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AN  EVALUATION  OF  SPEECH  COMPRESSION  SYSTEMS 

ABSTRACT 

The  results  of  PB  word  and  nonsense  syllable  intelligibility  teats, 
voice  quality,  talker  identification,  and  continuous  speech  tests 
of  selected  speech  compression  systems  are  presented.  The  systems 
were:  a  "reference”  low-pass  (approximately  3000  cps)  filter  system, 
two  channel  vocoders,  a  semi-vocoder,  a  formant-tracking  vocoder  and 
a  multiple  narrow  band  filter  system.  The  status  of  various  speech 
compression  techniques,  current  relevant  research  and  recommendations 
for  futxire  research  and  development  in  this  area  are  reported. 
Different  speech  compression  techniques  are  classified  according  to 
their  ability  to  provide  a  given  level  of  speech  Intelligibility  at 
different  information  rates. 

It  is  Judged  that  cl'.annel  vocoders  operating  at  about  2400  bits/ sec 
and  semi-vocoders  at  an  estimated  9600  bits/sec  provide  adequate 
Intelligibility  and  quality  for  most  military  communications;  the 
quality  of  the  semi-vocoder  is  superior  to  the  channel  vocoder. 
Formant-tracking  vocoders  utilize  the  lowest  information  rate  (about 
1000  blts/sec)  of  any  of  the  bandwidth  compression  techniques.  For¬ 
mant-tracking  vocoders  require  further  Improvement  before  they  can  be 
as  satisfactory  for  general  use. 
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1.  INTRODUCTION 

The  alma  set  forth  for  Contract  USAP  30 (602) -2235 >  "An  Evaluation 
of  Speech  Compression  Techniques,”  were: 

(1)  to  determine  the  relative  strength  and  weakness  of 
presently  available  speech  compression  techniques, 

(2)  to  eval\iate  these  techniques  as  to  possible  future 
potential  and  expansion, 

(3)  to  determine  the  best  method  for  equipment  development 
In  the  near  future,  and 

(4)  to  determine  the  best  areas  for  future  Intensive  research 
effort . 

It  became  apparent  early  In  the  work  on  the  contract  that  these 
goals  covild  not  be  met  with  confidence  on  the  basis  of  existing 
Information  and  published  reports. For  example,  although 
steady  progress  has  been  made  toward  an  vmder standing  of  the  proc¬ 
esses  of  hvunan  speech  generation  and  perception  over  the  past  few 
years,  the  evalijatlon  of  speech  compression  procedures  or  techniques 
remains  largely  an  empirical  question.  The  capability  of  a  given 
compression  technique  must  be  measured  by  an  actxial  performance 
test  and  cannot  be  determined  solely  through  an  assessment  of  the 
technique  In  terms  of  some  theory  of  speech.  It  also  became  obvious 
during  the  Investigation  that  published  test  results  of  the  performance 


♦  References  are  listed  In  the  Appendix,  Section  5*1* 
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of  Individual  speech  compression  systems  must  be  Interpreted  with 
caution  because  of  (a)  the  Inherent  unreliability  of  Intelligibility 
test  scores;  (b)  the  Influence  on  the  test  scores  of  extensive 
training  of  listeners  for  speech  material  processed  by  a  particular 
system;  (c)  the  somewhat  unnatiiral  mode  of  presentation  of  speech 
material  In  some  performance  tests;  and  (d)  the  frequent  lack  of 
a  common  reference  system,  by  means  of  which  one  may  compare  the 
general  capability  of  the  talkers  and  the  listening  crew,  and  also 
study  the  effects  of  various  test  materials. 

Accordingly,  an  Important  part  of  the  present  project  was  a  testing 
program  In  which  the  performance  of  representative  speech  compres¬ 
sion  systems  was  measiu*ed  by  various  types  of  listening  tests. 
Following  the  testing  program,  an  evaluation  of  the  various  speech 
compression  techniques  was  made,  both  with  respect  to  their  suit¬ 
ability  for  Immediate  application  and  development  and  also  with 
respect  to  their  ultimate  capabilities,  possibly  after  several  years 
of  research  and  development.  This  evaluation  was  made  partly  on  the 
basis  of  the  results  of  the  testing  program,  partly  on  the  basis  of 
published  Information  on  various  speech  compression  systems,  and 
partly  on  the  basis  of  existing  knowledge  of  the  acoustics  of  speech 
and  of  the  perception  of  speech. 

The  present  report  describes  the  various  phases  of  the  testing  program 
in  Section  2  and  gives  an  Interpretation  of  the  test  results  In 
Section  3.  A  general  evalxiatlon  of  various  speech  compression  tech¬ 
niques  Is  presented  In  Section  4,  together  with  recommendations 
concerning  the  present  and  future  potential  of  the  techniques.  Sev¬ 
eral  types  of  research  studies  that  may  contribute  to  the  future 
development  of  new  or  improved  speech  compression  techniques  are  also 
discussed  In  Section  4. 
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2.  PERFORMANCE  TESTS  OP  SELECTED  SPEECH  COMPRESSION  SYSTEMS 
2.1  List  of  Systems  Tested 

The  following  speech  compression  systems  were  tested  In  the  course 
of  the  present  study;* 

(1)  Reference  (low-pass)  system.  This  system  consists  of  a 
Spencer -Kennedy  Model  302  Electronic  Filter,  set  for  1500  cps 
low-pass  operation  with  a  characteristic  slope  of  -36  db/octave. 
The  reference  system  Is  sometimes  designated  in  this  report  as 
system  R. 

(2)  Channel  vocoder.  Two  systems  were  tested;  Model  H5f-2, 
Phllco  Company,  courtesy  of  the  U.  S.  National  Security  Agency; 
and  Model  HC-135j  Hughes  Aircraft  Company,  Communications 
Division.  Each  vocoder  has  a  2400  blts/sec  digital  output, 
and  an  estimated  400  cps  analog  bandwidth.  The  phllco  vocoder 
Is  designated  as  system  similarly,  the  Hughes  vocoder  Is 
designated  as  system  H. 

(3)  Semi-vocoder.  General  Dynamics  Corporation,  Stromberg- 
Carlson  Division.  This  "base-band"  vocoder  has  an  estimated 
analog  bandwidth  of  900  cps  and  Is  designated  as  system  S. 


*  A  brief,  functional  description  of  these  systems  (except  the 
Tasaroff-Daguet  system)  is  given  In  Section  3  of  this  report. 

For  reasons  of  security  classification,  the  Tasaroff-Daguet 
system  Is  described  In  a  supplement  of  this  report.  Section  6. 

The  results  for  the  Tasaroff-Dag^uet  system  are  presented  along 
with  results  for  the  other  systems  In  the  body  of  the  report,  but 
the  Interpretation  and  evaluation  of  the  data  for  that  system  are 
given  In  the  supplement. 
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(4)  Formant  vocoder.  Melpar,  Inc.  The  formant -tracking 
vocoder  tested  produced  a  digital  Information  stream  of 
1000  blts/secj  the  bandwidth  for  analog  operation  Is  approxi¬ 
mately  140  cps.  The  Melpar  vocoder  Is  designated  as  system  M. 

(3)  Spectrum  sampling  (narrow-band)  system.  Bolt  Beranek  and 
Newman  Inc.  The  analog  bandwidth  utilization  Is  800  cps. 

This  "narrow-band”  system  Is  designated  here  as  system  N. 

(6)  Tasaroff-Daguet  system.  Courtesy  of  U.  S.  Army  Signal 
Research  and  Development  Agency.  The  estimated  analog  bandwidth 
Is  approximately  1000  cps.  The  Tasaroff-Daguet  system  Is  des¬ 
ignated  as  system  T. 

These  systems  were  selected  because  they  represent  examples  of 
several  different  approaches  to  speech  compression  encompassing  a 
range  of  bandwldths  or  digital  transmission  rates,  and  also  because 
they  happened  to  be  available  for  testing. 

Other  methods  for  reducing  the  bandwidth  required  for  transmitting 

speech  have,  of  coxirse,  been  developed  or  proposed.  Some  of  these 

methods  are  discussed  In  Section  4  of  this  report.  One  technique 

of  particular  Interest  Is  the  pattern-correspondence  scheme,  re- 

48  4q 

ported  by  C.  P.  Smith.  *  ^  This  system  was  not  completely  assembled 
at  the  time  the  present  experiments  were  carried  out,  and  hence  could 
not  be  tested.  It  Is  understood,  however,  that  tests  comparing  the 
pattern-correspondence  scheme  with  more  conventional  vocoder  methods 
will  be  carried  out  by  C.  P.  Smith. 
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2.2  Summairy  of  Testa 

The  apeech  compreaalon  syatems  listed  above  were  subjected  to 
several  types  of  tests  in  the  course  of  the  present  study.  These 
tests  were  designed  to  measure; 

(1)  the  intelligibility  of  phonetically  balanced  (PB)  words, 

(2)  the  intelligibility  of  nonsense  syllables,  with  emphasis 
on  the  confusions  made, 

(3)  the  general  quality  of  the  processed  signal, 

(4)  the  acc\iracy  with  which  listeners  can  recognize  a  given 
talker  out  of  a  small  group  of  talkers,  and 

(5)  the  comprehension  of  continuous  apeech  as  a  function  of  the 
degree  of  noise  Interference. 

The  reasons  for  selecting  these  particular  types  of  tests  will  be 
discussed  in  the  following  sections.  In  general,  however,  the 
objectives  were  to  devise  a  group  of  tests  that  could  be  related 
to  other  tests  performed  in  the  past,  could  provide  some  measure 
of  the  ability  of  talkers  and  listeners  to  communicate  through  the 
systems,  and  coiild  give  diagnostic  information  that  would  indicate 
any  basic  limitations  in  a  particular  system  or  would  suggest  modi¬ 
fications  that  may  be  made  to  Improve  system  performance. 

The  general  testing  procedure  was  as  follows;  Various  tests,  de¬ 
signed  to  measure  the  factors  indicated  above,  were  recorded  under 
laboratory  conditions  on  magnetic  tape.  These  test  recordings  were 
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then  played  back  through  a  number  of  speech  compression  systems, 
and  the  outputs  of  the  devices  were  recorded  on  magnetic  tape.  The 
recordings  of  the  outputs  of  the  various  systems  were  ultimately 
presented  as  tests  to  a  crew  of  trained  listeners  \mder  laboratory 
conditions. ♦ 

The  prosecution  of  this  test  program  was  made  possible  through  the 
cooperation  of  the  various  Industries  and  governmental  agencies 
responsible  for  the  development  and  manufactvire  of  the  devices 
tested.  All  play-backs  of  the  different  speech  tests  through  the 
various  systems  tested  and  all  recordings  of  the  outputs  of  these 
systems  were  made  tinder  the  supervision  of  personnel  responsible 
for  the  equipment,  at  the  plant  or  laboratory  where  the  equipment 
was  developed. 

2.3  Intelligibility  Tests  Using  PB  Word  Lists 

The  phonetically  balanced  (PB)  word  test  material  consists  of  1000 
common  monosyllabic  words  divided  Into  twenty  50-ltem  llsts.^^  Each 
list  contains  the  different  types  of  speech  sounds  with  a  frequency 
of  usage  approximating  that  found  In  everyday  American  English.  The 
score  obtained  on  each  list  should,  therefore,  be  a  generally  valid 
measure  of  the  adequacy  with  which  the  communication  system  under 
test  can  handle  everyday  speech.  Since  the  PB  word  tests  consist  of 
a  large  number  of  comparable  lists,  many  conditions  can  be  tested 


*  The  listeners  wore  monaurally-fltted  TDH-39  earphones  made  by  the 
Telephonies  Company;  the  earphones  were  calibrated  on  a  6  cc  coupler 
and  found  to  be  flat  within  +  5  db  from  100  to  7000  cps. 
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In  a  alngle  experiment  without  over-exposing  the  listeners  to 
partlcixlar  words.  PB  word  tests  have  been  so  widely  used  that  a 
standard  has  been  prepared  as  a  guide  to  their  use  by  engineers.^ 
These  tests  are  typically  scored  in  terms  of  the  percentage  of  words 
recorded  by  the  listeners  that  are  totally  correct. 

Factors  Influencing  intelligibility  test  scores.  Scores  for  PB 
word  tests  are  Influenced  by  several  factors  that  often  make  it 
difficult  to  compare  the  performance  of  speech  communication  systems 
that  have  been  tested  in  different  experiments.  It  has  been  demon¬ 
strated  that  the  nimber  of  PB  word  lists  actually  used  in  testing  a 

system  has  a  significant  effect  upon  the  score  obtained  for  that 
-ay 

system.-^'  For  example,  the  score  may  be  higher  by  as  much  as  20 
percentage  points  if  the  listeners  have  been  exposed  to  only  four 
50-word  lists  (or  a  total  of  200  different  items),  than  if  all  twenty 
50-word  lists  (or  1000  different  items)  had  been  used.  Because  In¬ 
telligibility  tests  reported  in  the  literat\are  indicate  the  use  of 
different  numbers  of  word  lists,  comparisons  of  the  results  of  these 
experiments  can  be  made  only  with  considerable  caution,  if  at  all. 

Another  difficulty  in  the  interpretation  of  Intelligibility  test 
results  is  that  the  scores  tend  to  get  progressively  higher  as  the 
listeners  gain  more  experience  with  a  given  system.  This  is  partic¬ 
ularly  true  for  speech  compression  systems  whose  outputs  have  a 
peculiar  sovind  quality.  Under  the**®  conditions,  with  day-after-day 
training  on  a  limited  set  of  words  and  talkers,  it  is  possible  to 
obtain  Intelligibility  scores  that  give  an  erroneous  impression  of 
the  performance  of  the  system  for  naive  listeners. 

There  are  two  principal  types  of  learning  that  seem  to  take  place  in 
a  speech  Intelligibility  test  program.  The  first  and  most  obvious  1-s 
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the  learning  of  the  voices  of  the  particular  talkers  Involved  and  of 
the  speech  material  itself.  In  our  experiments  we  attempted  to 
overcome  this  type  of  learning  by  presenting  to  o\ir  listening  crew 

for  several  weeks  (three  3-ho\ar  sessions  per  week)  special  scramblings 
of  our  PB  word  and  nonsense  syllable  tests.  The  reference  system,  a 
simple  low-pass  filter  placed  In  an  otherwise  high-fidelity  trans¬ 
mission  link,  was  used  for  these  tests,  although  a  few  tests  were 
also  given  with  a  channel  vocoder. 

The  second  type  of  learning  becomes  evident  as  listeners  become 
more  and  more  familiar  with  the  characteristics  Imposed  upon  dif¬ 
ferent  speech  sounds  by  a  particular  system.  Initially,  the  listen¬ 
ing  crew  used  In  these  tests  was  generally  unfamiliar  with  the  speech 
compression  systems  that  we  wished  to  evaluate.  Although  the  amount 
of  experience  and  listening  afforded  each  system  was  approximately 
equal  In  the  test  evalmtion  program,  the  reference  system,  having 
also  been  used  for  training  purposes,  was  presented  more  often. 

The  scores  obtained  for  the  reference  system  probably  benefited 
by  an  extra  amount  from  both  types  of  learning  outlined  above. 

The  effects  of  extensive  experience  with  the  reference  system  are 
Illustrated  In  Fig.  2.3-1.  The  scores  recorded  during  the  Initial 
training  period  and  dvirlng  the  course  of  the  experiment  are  shown 
for  two  slgnal-to-nolse  ratio  conditions.  Prom  this  figure  we  would 
estimate  that  the  reference  system  scores  for  the  experiment  proper 
are  probably  5  to  10  percentage  points  higher  than  they  should  be, 
relative  to  the  scores  obtained  for  the  other  speech  compression 
systems,  because  of  excessive  exposure. 
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Design  of  the  present  teat  program.  The  recorded  tests  for  each 
communication  system  to  be  evaluated  were  arranged  to  provide  four 
sub-experiments.  Bach  sub-experiment  was  designed  so  that  the  order 
of  presentation  of  the  tests  obtained  from  the  various  talkers  was 
randomly  distributed  among  the  systems.  Through  this  procedure 
we  feel  that  the  effects  of  learning  and  order  of  presentation  were 
approximately  equally  distributed  among  the  various  speech  compression 
systems  and  that  any  one  system  was  not  favored  by  Its  position  In 
the  testing  sequence. 


The  PB  word  tests  that  were  prepared  for  determining  the  Intelli¬ 
gibility  of  the  speech  compression  systems  may  be  conveniently 
divided  Into  four  groups: 


No.  of  Lists 

Group _ Talker _ S/N  Ratio _ Microphone* _ per  System 


I 

Male  1 

Optimal 

Dynamic 

4 

Male  2 

Optimal 

Dynamic 

4 

II 

Female  1 

Optimal 

Dynamic 

4 

Female  2 

Optimal 

Dynamic 

4 

III 

Male  1 

15  db 

Dynamic 

4 

IV 

Male  1 

Optimal 

Carbon 

2 

Male  2 

Optimal 

Carbon 

2 

*  The  dynamic  microphone  used  for  making  the  test  recordings  was 
an  Altec-Lanslng  Model  66lA.  The  talker  read  the  words  In  a 
soundproofed,  seml-anecholc  room,  with  the  microphone  positioned 
approximately  10  Inches  from  his  lips.  The  carbon  microphone  was 
that  of  a  standard  telephone  handset.  Western  Electric  Model  500. 

The  talker  held  the  handset  In  a  normal  position  with  the  mouthpiece 
near  his  lips.  All  test  recordings  and  system  output  recordings 
were  made  on  either  an  Arapex  Model  350  or  Model  600  tape  recorder 
operating  at  7-1/2  l.p.s. 


-10- 


Report  No.  91^ 


Bolt  Beranek  and  Newman  Inc. 


In  groups  I  and  II  each  talker  recorded  a  different  version  of  all 
twenty  50-word  lists.  A  set  of  four  of  these  recordings  from  each 
talker  was  then  selected  for  processing  by  one  of  the  seven  speech 
compression  systems.  Other  sets  of  four  recordings  were  selected 
for  each  of  the  remaining  systems,  avoiding  duplicate  choices  as 
much  as  possible.  In  group  III  the  talker  recorded  another  version 
of  all  twenty  lists,  this  time  against  a  constant  backgrotind  of 
filtered  white  noise.  As  before,  a  set  of  four  recordings  was 
reserved  for  each  system.  Complete  versions  of  all  twenty  lists 
were  not  recorded  In  group  IV,  but  a  set  of  two  recordings  from  each 
talker  was  chosen  for  processing  by  each  of  the  fo\ar  systems  tested. 

In  addition  to  the  tests  described  In  the  above  four  groups,  two 
PB  word  lists  were  recorded  by  male  talker  No.  1  at  Melpar,  Inc., 
using  their  microphone  facilities.  While  the  recording  was  being 
made,  the  lists  were  processed  "live"  by  the  Melpar  system;  l.e., 
the  microphone  signal  was  recorded  and  simultaneously  sent  through 
the  system.  The  recording  of  the  microphone  signal  was  then  played 
back  and  processed  by  the  Melpar  system.  Later,  this  same  recording 
of  the  Melpar  microphone  signal  was  also  processed  by  the  Hughes 
system.  On  another  occasion,  male  talker  No.  1  read  two  different 
PB  word  lists  live  through  the  Phllco  system,  using  the  microphone 
normally  used  with  that  system. 

The  Melpar  and  Phllco  systems  were  tested  live  In  order  to  resolve 
a  question  which  had  been  raised  by  some  Industries  responsible 
for  the  development  of  speech  compression  devices  regarding  the 
comparability  of  recorded  tests  and  live  tests.  The  question  may 
be  considered  to  consist  of  thi’ee  parts; 
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1.  The  Inherent  background  noise  and  frequency  distortion  of 
the  tape  recording  process  could  possibly  reduce  the  performance 
of  a  sensitive  system  and  thus  degrade  intelligibility  scores; 

2.  The  microphone  used  in  making  the  tape  recordings  may  have  a 
frequency  response  that  is  significantly  different  from  the 
response  of  the  microphone  for  which  a  particular  system  is 
designed; 

3.  There  may  be  a  degradation  in  Intelligibility  because  the 
operator  of  some  systems  normally  holds  the  microphone  within 
an  inch  or  two  from  his  lips,  whereas  the  tape  recordings  were 
made  with  the  microphone  about  10  Inches  from  the  talker's  lips. 

Group  I;  Male  talkers  in  quiet.  The  results  of  the  PB  word  tests 
from  Group  I,  averaged  over  the  four  sub-experiments  and  over  a  crew 
of  eight  listeners,  are  shown  in  Pig.  2.3-2  by  the  solid  dots.  The 
spread  of  the  scores  (averaged  only  over  the  listeners)  within  the 
sub-experiments  are  also  Indicated.  This  figure  also  presents  the 
average  scores  obtained  for  the  tests  which  were  master  recorded  at 
Melpar,  Inc.,  and  for  the  tests  recorded  live  with  the  Melpar  and 
Phllco  systems.  These  scores  are  shown  by  the  open  circles. 

Figure  2.3-2  shows  that  the  Intelligibility  of  the  Melpar  system 
Improved  through  the  use  of  a  close- talking  microphone,  but  also  that 
the  system  was  not  adversely  affected  by  using  a  tape  recording  as  an 
input  instead  of  a  direct  mlrrophone.  The  Phllco  system  is  apparently 
less  sensitive  to  the  distance  at  which  the  microphone  is  used,  and 
very  similar  scores  are  obtained  whether  the  system  is  tested  by  means 
of  our  regular  tape  recordings  or  live  with  the  close-talking  Phllco 
microphone.  The  overall  slgnal-to-nolse  ratio  was  measxired  during 
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the  recording  of  all  master  test  tapes  and  was  fovind  to  be  In  excess 
of  38  db.  In  contrast,  the  slgnal-to-nolse  ratio  measured  at  the 
input  of  several  systems  that  were  operated  live  hardly  approached 
and  never  exceeded  this  value.  We  therefore  conclude  that  the  re¬ 
cording  process  used  does  not  have  detrimental  effects  on  the  opera¬ 
tion  of  the  speech  compression  devices  that  were  tested. 

On  the  basis  of  the  present  test  results  it  appears  that  the  Melpar 
formant  vocoder  and  the  Hughes  channel  vocoder  perform  better  when 
operated  with  a  close-talking  microphone  instead  of  a  microphone 
that  is  used  at  some  distance  from  the  speaker.  A  possible  explana¬ 
tion  for  this  is  that  some  systems  are  more  sensitive  than  others  to 
the  subtle  spectrum  differences  between  speech  waves  picked  up  close 
to  the  mouth  (near-field  condition)  and  those  picked  up  at  a  point 
remote  from  the  mouth  (far-fleld  condition).  Also,  it  is  reasonable 
to  expect  slight  differences  in  the  performance  of  a  given  system 
depending  on  the  characteristics  of  the  dynamic  microphone  being  used. 

Table  2.3-1  indicates  which  differences  between  the  mean  scores 
obtained  for  the  seven  rank-ordered  systems  are  statistically  sig¬ 
nificant,  These  data  are  based  on  an  analysis  of  variance  of  test 
score  distributions  (see  Table  5.2-1)  and  on  the  application  of  t 
tests  (see  Table  5.2-2). 
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Table  2.3-1 


Differences  Between  Scores 

Obtained  for  Group 

I:  Male  Talkers  in  Quiet 

(8  Subjects) 

Significant* 

System 

Rank 

Average  PB  Score 

Difference 

Reference 

1 

95  , 

—  Yes 

Stromberg 

2 

86 

—  No 

Philco 

3 

85 

—  Yes 

Tasaroff-Daguet 

4 

79 

—  Yes 

Narrow  Band 

5 

68 

—  Yes 

Hughes 

6 

61 

1 —  Yes 

Melpar 

7 

33 

1 

(♦Statistically  significant  at  the  p  ^0.01  level  of  confidence.) 
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Group  lit  Female  talkers.  The  results  of  the  tests  from  Group  II, 
again  averaged  over  the  sub-experiments  and  over  the  listeners,  are 
given  in  Pig.  2.3-3.  The  Melpar  and  Tasaroff-Daguet  systems  were 
not  tested  for  Group  II.  The  spread  of  the  scores  over  the  sub¬ 
experiments  Is  observed  to  be  generally  greater  here  than  for  the 
corresponding  tests  feattirlng  male  talkers. 

Table  2.3-2  shows  that  the  scores  for  the  female  talkers  fall  off 
more  sharply  for  the  channel  vocoders  and  the  semi-vocoder  than  for 
the  reference  and  narrow-band  systems.  In  the  case  of  the  channel 
vocoders  this  may  be  attributed  to  a  difficulty  in  properly  tracking 
the  higher  fxindamental  frequency  of  the  female  voices.  In  the 
case  of  the  semi -vocoder  the  reduced  Intelligibility  with  female 
talkers  may  be  explained  In  terms  of  the  spectral  location  of  the 
base-band.  Fewer  harmonics  of  the  female  voice  are  encompassed  In 
this  band,  and  hence  It  Is  more  difficult  to  generate  an  excitation 
signal  with  a  relatively  uniform  spectrum. 


Table  2.3-2 


Average  PB  Word 

Scores 

for  Male 

and  Female  Talkers 

System 

Male 

Female 

Difference 

Reference 

95 

83 

12 

Narrow-Band 

68 

52 

16 

Phllco 

85 

59 

26 

Hughes 

61 

34 

27 

Stromberg 

86 

56 

30 
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Table  2.3‘’3  Indicates  which  differences  between  the  mean  scores 
obtained  for  the  five  rank-ordered  systems  are  statistically 
significant.  The  data  are  based  on  an  analysis  of  variance  of  test 
score  distributions  (see  Table  5.2-3)  and  on  the  application  of  Jt 
tests  (see  Table  5.2-4).  The  only  two  systems  that  do  not  score,  on 
the  average,  significantly  different  from  each  other  are  the  Strom- 
berg  semi -vocoder  and  the  Phllco  channel  vocoder. 


Table  2.3-3 

Differences  Between  Scores  Obtained  for  Groups  I  and  II 
Male  and  Female  Talkers  In  Quiet 
(8  Subjects) 

System  Rank  Average  PB  Score 


Significant* 

Difference 


Reference  1  89 
Phllco  2  72 
Stromberg  3  71 
Narrow-Band  4  60 
Hughes  5  47 


(♦Statistically  significant  at  the  p 


—  Yes 

—  No 

— :  Yes 

—  Yes 

i  0.01  level  of  confidence.) 


Prom  these  statistical  Investigations  It  Is  evident  that  for  the  present 
PB  word  data  a  difference  of  4  percentage  points  or  more  Is  significant. 
This  Is  In  general  agreement  with  previous  studies  of  PB  word  intelli¬ 
gibility  tests,  where  It  has  been  found  that  when  averaged  over  100 
to  200  PB  words  (2  to  4  50-word  lists)  differences  of  5  or  more 
percentage  points  within  a  given  experiment  prove  to  be  statistically 
significant.^^ 
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Group  III;  Male  talkers  In  noise.  A  series  of  PB  word  tests  were 

recorded  with  noise  mixed  in  electrically  at  the  input  to  the 

recorder.  The  noise  was  obtained  from  a  white  noise  generator  and 

was  filtered  to  have  a  spectrum  similar  to  the  long-term  average 

12 

speech  spectrum.  The  signal-to-nolse  ratio  was  15  db,  as  measured 
on  a  standard  true  RMS  voltmeter  set  on  "slow”  meter  action.  The 
level  of  the  speech  was  taken  as  the  decibel  average  of  the  speech 
levels  measured  on  each  word  in  one  PB  word  list.  These  recordings 
were  made  in  order  to  demonstrate  how  the  commiinlcatlon  systems 
\inder  test  would  perform  if  the  talker  had  been  in  a  moderate  amount 
of  ambient  noise. 

The  res^llts  obtained  with  these  test  recordings  are  shown  in  Pig. 
2.3-4.  For  comparison  piirposes,  the  average  scores  obtained  for 
the  quiet  condition  are  also  indicated  in  the  figure. 

The  noise  has  a  slight  depressing  effect  upon  the  performance  of 
the  reference,  narrow-band  and  Tasaroff-Daguet  systems,  and  a 
drastic  and  harmftil  effect  upon  the  performance  of  the  Stromberg 
semi -vocoder,  the  Phllco  channel  vocoder  and  the  Melpar  formant 
vocoder.  On  the  other  hand,  the  noise  improved  the  performance  of 
the  Hughes  channel  vocoder.  This  finding  was  also  borne  out  in  the 
relative  comprehension  test,  the  results  of  which  are  given  in 
Section  2.7.  Comparative  listening  to  the  Hughes  vocoder  when 
operated  in  the  quiet  and  in  ambient  noise  produces  the  impression 
that  the  noise  tends  to  stabilize  the  performance  of  the  pitch 
extractor  of  the  Instrument.  However,  since  the  vocoder  was  operated 
in  its  digital  mode,  it  is  also  possible  that  some  misalignment  in 
the  digitizer  circuits  is  responsible  for  this  vinexpected  result. 
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Group  IV;  Male  talkers  In  quiet,  carbon  microphone.  Plgiire  2.3-5 
shows  the  results  obtained  for  male  talkers  when  using  a  telephone- 
type  carbon  microphone.  Also  shown  In  the  flg\ire  are  average  results 
obtained  with  the  dynamic  microphone.  It  Is  seen  that  the  carbon 
microphone  lowers  the  Intelligibility  scores  by  4  to  10  percentage 
points.  In  view  of  the  typically  poorer  frequency  response  of  a 
carbon  microphone  In  comparison  to  a  dynamic  microphone,  such  a 
reduction  In  scores  Is  to  be  expected;  that  the  scores  are  no  lower 
than  Is  Indicated  here  Is  perhaps  svirprlslng. 

2.4  Intelligibility  Tests  Using  Nonsense  Syllables 

In  the  evaluation  of  speech  compression  systems  we  would  often 
like  to  know  In  detail  the  performance  of  a  system  for  various 
phonemes  and  classes  of  phonemes  and  for  various  distinctive  features 
of  the  phonemes.  This  type  of  Information  may  frequently  help  the 
experimenter  to  Isolate  the  portion  of  a  system  that  Is  responsible 
for  a  defect  In  performance  and  may  lead  to  the  design  of  suitable 
corrective  measures.  The  Information  may  also  help  to  Indicate 
any  fxindamental  limitations  In  a  particular  speech  compression 
technique.  It  may  suggest,  for  example,  that  a  particular  technique 
Is  Inherently  Incapable  of  making  distinctions  In  the  acoustic 
signal  that  are  necessary  cues  for  the  Identification  of  a  partlcvilar 
distinctive  feature  of  a  phoneme. 

In  the  type  of  Intelligibility  test  that  Is  generally  used  for 
diagnostic  purposes  the  test  material  consists  of  nonsense  syllables.^® 
The  nonsense  material  Is  usually  monosyllabic,  and  consists  of 
different  vowels  preceded  and/or  followed  by  a  variety  of  consonants 
and  consonant  clusters.  The  number  of  consonants  and  consonant 
clusters  that  are  used  In  American  English  Is  quite  large  (25-odd 
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FIG.  2.3-5  AVERAGE  SCORES  ON  PB  WORD  TESTS 
MALE  TALKERS  IN  QUIET  (GROUPIE) 
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consonants  and  an  even  greater  number  of  clusters),  and  consequently 
a  large  number  of  syllables  must  be  used  If  all  situations  are  to 
be  studied.  In  order  to  avoid  an  Inordinate  amount  of  testing  time, 
some  compromise  Is  usually  made  and  a  shorter  list  of  syllables  Is 
used.  The  total  list  of  syllables  Is  still,  however,  very  long. 
Furthermore,  highly  trained  listeners  are  required  In  this  type  of 
test,  and  the  listeners  must  learn  a  long  list  of  phonetic  symbols 
or  their  equivalent. 

As  part  of  the  present  program  of  evaluating  various  speech  com¬ 
pression  systems,  a  group  of  new  nonsense  syllable  tests  has  been 
developed.  The  approach  that  has  been  used  represents  one  possible 
compromise  to  the  problem  of  formulating  suitable  tests  for  diagnos¬ 
ing  certain  aspects  of  the  performance  of  speech  compression  systems. 
Some  of  the  tests  are  Intended  for  the  study  of  consonant  Intelli¬ 
gibility  only;  other  testa  evaluate  vowel  Intelligibility.  Each 
teat  in  the  group  is  quite  short,  and  Is  designed  to  evaluate  the 
performance  of  a  speech  communication  link  for  only  one  or  two  con¬ 
sonant  or  vowel  features.  A  large  ntamber  of  separate  tests  are 
therefore  required  to  test  a  significant  niunber  of  different  vowel 
and  consonant  sounds;  however,  because  each  test  is  short  and  because 
each  response  must  be  selected  from  one  of  only  a  small  set  of 
possible  responses,  the  listeners  find  the  tests  to  be  relatively 
easy  to  take,  and  the  scores  stabilize  with  very  little  training. 

Consonant  tests.  A  list  of  the  consonants  tested  Is  shown  In  Table 
2.4-1.  Most,  but  not  all,  of  the  consonants  of  American  English 
are  included. 
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Table  2.4-1 

Listing  of  Consonants  Used  In  the  Nonsense  Syllable  Tests 
Arranged  According  to  Place  of  Production  (columns) 
and  Manner  of  Production  (rows) 

ABC 

Bilabial  Alveolar 

Lablo-dental  Post-dental  Velar 

1.  Voiceless  stops  p  t  k 

2.  Voiced  stops  b  d  g 

3.  Voiceless  fricatives  f  0  (thin)  s  f  ( shoe ) 

4.  Voiced  fricatives  v  z  3  (bdgft) 

3.  Nasals  m  n  0  (sing.) 

6.  Glides  w  (win)  J  (^es) 

7.  Liquids  a  V 

Some  consonant  clusters  are  tested  In  addition  to  the  single 
consonants  listed.  The  consonants  In  this  table  are  arranged  In  a 
way  that  Indicates  roughly  the  manner  of  production  according  to 
rows,  and  the  place  of  production  according  to  columns.  Thus 
Column  A  lists  the  bilabial  and  lablo-dental  consonants,  all  of 
which  are  produced  with  a  vocal-tract  constriction  at  the  anterior 
end  of  the  vocal  tract.  Column  B  lists  the  consonants  that  are 
produced  with  a  constriction  Immediately  behind  the  teeth,  and 
Column  C  lists  those  that  are  produced  with  a  constriction  that  Is 
further  back  In  the  vocal  tract,  l.e.,  the  alveolar  and  velar 
consonants . 


Report  No.  91^ 


Bolt  Beranek  and  Newman  Inc. 


The  voiceless  and  voiced  stop  consonants  are  shown  In  rows  1  and 
2  of  the  table,  while  rows  3  and  4  list  the  voiceless  and  voiced 
fricatives.  The  group  /m  n  q/  In  row  5  are  nasals,  /w  J/  In  row  6 
are  called  glides,  and  /Z  r/  In  row  7  are  called  liquids. 

The  list  of  utterances  that  constitute  each  of  the  consonant  tests 
contains  several  versions  of  each  of  a  relatively  small  nxomber  of 
consonants  (fo\ir  to  eight)  occurring  In  Initial  and/or  In  final 
positions  In  syllables.  Each  list  Is  further  simplified  by  In¬ 
cluding  only  two  possible  syllabic  nuclei  or  vowels.  One  of  the 
vowels  Is  always  a  long  vowel  and  the  other  Is  short;  one  Is  a 
back  vowel  and  the  other  la  a  front  vowel.  Pour  lists  of  syllables, 
differing  only  In  the  pair  of  vowels  used,  are  assembled  for  any 
one  group  of  consonants.  The  vowels  are  always  selected  from  the 
set  /l  I  £  •*  a  A  v  u/,  and  fo\ir  different  pairs  of  these  vowels  are 
selected  to  assemble  test  lists  for  each  consonant  group.  Within 
any  one  list  a  given  consonant  generally  appears  once  In  Initial 
position  and  once  In  final  position  with  any  one  vowel. 

In  all,  nine  groups  of  consonants  were  tested,  and  hence  the  total 
number  of  syllable  lists  was  36.  The  nine  groups  of  consonants 
are  listed  In  Table  2.4-2.  In  this  table,  tests  la,  lb,  Ic  and  Id, 
for  example,  are  the  foxir  tests  that  examine  the  six  consonants 
/p  t  k  b  d  g/;  the  vowel  pairs  for  this  set  of  tests  were,  respec¬ 
tively,  /1a/,  /xu/j  /la/  and  /e  u/.  The  final  column  In  the 
table  Indicates  for  each  row  the  particular  features  that  are  tested 
by  examining  responses  to  the  group  of  consonants  In  that  row. 

The  test  syllables  In  all  cases  except  the  final  clusters  are  pre¬ 
ceded  by  an  unstressed  carrier  syllable  /ha/.  Thus,  a  typical  Item 
In  teat  la  would  be  /h  a  p  a  d/.  In  the  case  of  the  tests  of  final 
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consonant  clusters «  monosyllabic  utterances  were  used,  preceded  by 
the  consonant  /h/.  Thus,  a  typical  Item  In  the  group  of  tests  8a- 
8d  would  be  /h  c  n  t/. 


Vowel  tests.  Eight  different  vowels  were  used  In  these  tests  -- 
the  vowels  /l  i  e  «  a  a  v  u/.  The  tests  were  divided  Into  four 
groups  of  two  tests  each,  for  a  total  of  eight  tests;  four  vowels 
were  Included  In  each  of  the  groups.  A  listing  of  the  vowels  for 
each  group  Is  given  In  Table  2.4-3. 

Table  2.4-2 

Groups  of  Consonants  Contained  In  Individual 
Nonsense  Syllable  Tests 


Tests  Consonants 

la-ld  p  t  k  b  d  g 

2a-2d  p  t  k  f  s  / 

3a-3d  b  d  g  V  z  5 

4a-4d  f  s  /  V  z  3 

5a-5d  b  d  V  z  m  n 

6a-6d  f  0  s  / 

72-7d  w  J  m  n  r  £  (initial) 

m  n  Q  r  £  (final) 

8a-8d  s  t  n  £  r  rt  st  £t  nt 

9a-9d  s  sp  st  sw  s£  sm  sn  str 


Features  Studied 

Voiced-voiceless  for  stop 
consonants;  place  for  stop 
consonants 

Interrupted-continuant  for 
voiceless  consonants;  place 
for  voiceless  consonants 

Interrupted-continuant  for 
voiced  consonants;  place  for 
voiced  consonants 

Voiced-voiceless  for  fricative 
consonants;  place  for 
fricative  consonants 

Manner  for  voiced  consonants; 
place  for  voiced  consonants 

Place  for  voiceless  fricatives 

Manner  for  voiced  consonants; 
place  for  voiced  consonants 

Pinal  clusters 

Initial  clusters 
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Vowels  Used  in 
Test 

10a,  10b 

11a,  11b 

12a,  12b 

13a,  13b 

Phonetic  symbols: 

i  (beat ) 

1  (bit) 

£  (bet) 

«.  (bat) 

Q.  (father) 

A  (but) 

'J  (foot) 
u  (boot ) 

The  structure  of  each  test  Item  was  similar  to  that  of  the  items  for 
most  of  the  consonant  tests,  l.e.,  each  utterance  consisted  of  the 
vinstressed  carrier  syllable  /h  z/  followed  by  a  stressed  consonant- 
vowel -consonant  syllable.  The  initial  and  final  consonants  were 
identical  for  a  given  test  item.  For  the  "a"  series  of  tests  (l.e., 
10a,  11a,  12a  and  13a)  the  consonant  environments  were  selected  from 
the  voiceless  consonants  /p  t  f  s/j  the  voiced  consonants  /b  d  v  z/ 
formed  the  environments  for  the  "b"  series  of  tests.  Thus  a  typical 
utterance  in  test  10a,  for  example,  was  /h  ^  f  i  f/.  In  each  test, 
each  vowel  occurred  once  in  each  consonant  environment,  so  that  there 
were  16  test  syllables  in  all. 


Table  2.4-3 

Each  of  the  Four  Groups  of  Vowel  Tests 
Vowels 

1  I  e  « 

a  A  U  u 
1  a  a  u 

I  £  A  u 


Description  of  Vowel  Series 
Front  vowels 
Back  vowels 
Long  vowels 

Short  vowels 
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Preparation  and  administration  of  tests.  As  noted  above,  the  nvunber 
of  test  Items  In  each  test  Is  determined  by  the  number  of  consonants 
and  vowels  that  are  Included  In  the  test.  Thus,  for  example,  tests 
la-ld,  which  each  Include  6  consonants  In  2  vowel  environments,  have 
a  total  of  12  test  Items  each;  each  vowel  test  Includes  4  vowels  In 
4  consonant  environments,  giving  a  total  of  16  test  Items.  In  the 
preparation  of  the  test  lists,  three  additional  nonsense  syllables 
were  added  to  each  list  of  test  Items  In  order  to  destroy  the  ap¬ 
parent  symmetry  of  the  tests.  These  "dummy”  Items  were  distributed 
randomly  throughout  each  list  and  were  not  Included  In  the  scores  for 
the  teat. 

Each  of  the  44  tests  was  recorded  by  two  talkers,  different  random 
orders  being  used  for  each  talker.  The  talkers  were  trained  In 
generating  these  types  of  utterances,  and  were  Instructed  to  read 
the  Items  with  constant  voice  effort  (rather  than  using  a  VU-meter 
to  monitor  the  level)  and,  as  far  as  possible,  with  the  same  inflec¬ 
tion  for  each  item.  All  recordings  were  later  monitored  by  both 
talkers,  and,  in  the  case  of  utterances  that  were  Judged  to  be  un¬ 
acceptable,  new  recordings  were  made.  These  recordings  were  processed 
by  the  various  speech  compression  systems  that  were  being  evaluated, 
except  for  the  Tasaroff-Daguet  system,  A  more  restricted  set  of  tests 
was  processed  by  the  Tasaroff-Daguet  system,  as  discussed  below. 

In  the  administration  of  the  tests,  the  listeners  were  provided  with 
answer  sheets  of  the  type  shown  on  Figs.  2.4-1  (typical  consonant 
test)  and  2.4-2  (typical  vowel  test).  For  a  given  consonant  test, 
the  possible  consonant  responses  are  Indicated  at  the  top  of  the  sheet. 
For  each  test  Item,  the  vowel  Is  given,  and  blanks  Indicate  where  the 
consonant  responses  are  to  be  entered.  In  the  case  of  the  vowel  tests, 
the  possible  vowel  responses  are  Indicated  at  the  top  of  the  sheet  and 
the  consonant  environment  Is  given  for  each  syllable. 
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The  listening  crew  was  given  several  training  sessions  for  nonsense 
syllables  before  the  test  material  from  the  various  speech  compression 
systems  was  administered.  The  first  few  training  sessions  Indicated 
some  Improvement  in  the  overall  performance  of  the  subjects,  but  the 
scores  reached  relatively  stable  vaTues  before  the  testa  proper  were 
begun.  For  the  vowel  testa,  phonetic  symbols  were  used  to  indicate 
responses,  but  only  about  two  hours  of  training  were  required  before 
these  symbols  were  learned  adequately.  The  problem  of  learning  the 
phonetic  symbols  is,  of  coxorse,  minimized  when  only  foiir  possible 
responses  are  required  in  a  given  test. 

The  same  crew  of  eight  listeners  was  used  for  both  the  PB  word  and 
the  nonsense  syllable  tests.  Suitable  precautions  were  taken  to 
balance  out  any  learning  that  might  have  occurred  throughout  the 
test  aeries,  as  discussed  in  Section  2.3  for  the  PB  word  tests. 

Nonsense  syllable  testa  used  with  the  Tasaroff-Daguet  system.  In  the 
overall  testing  program,  it  was  necessary  to  process  the  speech  re¬ 
cordings  by  the  Tasaroff-Daguet  system  several  months  before  the 
other  systems  were  tested.  At  the  time  the  Tasaroff-Daguet  tests 
were  made,  the  complete  set  of  nonsense  syllable  tests  described 
above  was  not  available,  and  it  was  therefore  necessary  to  use  a 
more  restricted  group  of  recorded  tests,  since  these  were  all  that 
were  available.  These  tests  had  been  recorded  by  the  same  talkers 
who  recorded  the  subsequent  more  extensive  series  of  tests.  The 
restricted  group  of  tests  were  designed  to  evalviate  consonant 
Intelligibility  only;  no  vowel  tests  were  available  for  processing 
by  the  Tasaroff-Daguet  system.  These  materials  Included  only  one  or 
two  tests  for  each  consonant  group  (rather  than  four,  as  in  the 
extensive  tests),  and  hence  not  all  vowel  environments  were  Included. 
Also,  the  particular  consonants  Included  in  two  of  the  tests  were 
slightly  different  from  those  in  Table  2.4-2. 
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Results .  The  results  of  the  nonsense  syllable  tests  are  presented 
in  Tables  2.4-4  through  2.4-10  as  a  series  of  confusion  matrices. 
These  matrices  In  which  the  entries  represent  percentages,  summarize 
results  pooled  for  the  two  talkers  and  for  all  listeners,  and 
combined  for  Initial  and  final  consonant  positions  where  applicable. 
Twenty- foxir  confusion  matrices  are  given  for  each  system  (except 
the  Tasaroff-Daguet  system).  The  test  data  that  were  pooled  to 
obtain  these  confusion  matrices  are  listed  In  Table  2.4-11. 

The  number  of  Individual  responses  on  which  an  entry  In  a  confusion 
matrix  Is  based  varies  from  144  for  the  vowel  tests  10-13  to  over 
1500  for  the  voiced-voiceless  distinction. 

The  data  from  the  confusion  matrices  In  Tables  2.4-4  through  2.4-10 
are  still  further  collapsed  In  Table  2.4-12,  which  lists  the  average 
percentage  error  for  each  confusion.  Each  entry  In  this  table,  when 
divided  by  100,  can  be  Interpreted  as  the  probability  that  a  stimulus 
Is  categorized  Incorrectly,  l.e.,  the  probability  that  a  response 
occxirs  as  any  off-dlagonal  entry  in  the  relevant  confusion  matrix. 

The  results  for  the  Tasaroff-Daguet  system  are  derived  from  a  re¬ 
stricted  series  of  tests,  and  cannot  be  compared  directly  with  data 
for  the  other  systems. 

Table  2.4-13  shows  the  result  of  averaging  data  from  groups  of 
features.  Some  of  these  averaged  data  are  plotted  In  the  various 
portions  of  Fig.  2.4-3.  The  PB  word  scores  for  male  voices  In  quiet, 
previously  shown  In  Pig.  2.3-2,  are  replotted  for  comparison  with  the 
various  nonsense  syllable  scores. 


-32- 


MANNER  PLACE 

•  VOICED- VOICELESS  •  VOICED  ST  a  FR 


REPORT  NO.  914 


BOLT  BERANEK  B  NEWMAN  INC 


FIG.  2.4-3  AVERAGE  PERCENT  ERRORS  FOR  DIFFERENT  CONSONANT  FEATURES, 
AS  DERIVED  FROM  RESULTS  OF  NONSENSE  SYLLABLE  TESTS. 
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Rationale  for  Interpretation  of  nonsense  syllable  tests.  The  Inter¬ 
pretation  of  the  data  from  the  nonsense  syllable  tests  may  be 
facilitated  by  a  brief  review  of  the  primary  acoustic  features  that 
seem  to  signal  the  various  distinctions  among  the  vowels  and  con¬ 
sonants  of  natural  speech.  A  given  speech  compression  technique 
may  Introduce  distortion  of  some  of  these  features  but  may  leave 
others  relatively  \inchanged.  Thus,  In  order  to  Interpret  the  con¬ 
fusion  matrices  that  are  obtained  for  a  given  system.  It  Is 
necessary  to  know  the  temporal  and  spectral  properties  of  the 
speech  sounds  that  form  the  major  cues  for  Identification  of  the 
soxinds. 

The  types  of  acoustic  featxires  that  signal  place  of  production  for 
a  consonant  depend  greatly  on  the  manner  of  production,  l.e.,  on 
the  row  in  which  the  consonant  appears  In  Table  2.4-1.  The  cues 
for  place  of  production  of  some  of  the  consonants  are  carried  by  the 
properties  of  the  transitions  to  and  from  the  adjacent  vowels.  It 
Is  known,  for  example,  that  the  voiced  stops  /b  d  g/  are  disting¬ 
uished  from  each  other  largely  on  the  basis  of  the  transitions  of 

the  formants  In  the  adjacent  vowel,  particularly  the  second  for- 

K  7  PQ  B  2Q  "^0 

mant.  The  same  is  true  of  the  nasal  consonants, 

PO  P‘2 

the  two  voiceless  fricatives  /f  0/  *  ^  and,  to  some  extent,  the 

1*5 

vjlceless  stops.  Some  of  the  consonants,  however,  such  as  the 

voiceless  fricatives  /f  s  //,  the  voiced  fricatives  /v  z  ^  /»  and 

to  some  extent  the  voiceless  stops,  are  undoubtedly  Identified  on 

the  basis  of  their  spectral  characteristics  during  intervals  when 

the  short-time  spectrvun  is  relatively  stationary.  *  The  liquids 

and  glides  are  characterized  by  changing  formants,  although  there  are 

apparently  certain  approximate  target  positions  for  the  first  two  or 

42  /  / 

three  formants  of  these  sounds.  The  glide  /w/,  for  exaiqple.  Is 
characterized  by  two  low-frequency  resonances  (roughly  250  and  700 
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cps  for  male  voices),  while  /j/  Is  characterized  by  a  low-frequency 
resonance  (about  250  cps)  and  two  closely-spaced  resonances  at  high 
frequencies  (around  2500-3000  cps). 

The  Identification  of  a  vowel  is  known  to  depend  primarily  on  the 
frequency  locations  F,  and  Pg  of  the  lowest  two  vocal-tract  res¬ 
onances  or  formants.  ^  These  formants  are  manifested  in  the  acoustic 
spectr\M  by  spectral  peaks  whose  frequency  locations  and  relative 
amplitudes  depend  upon  the  frequencies  of  the  formants.  Approximate 
average  values  of  the  first  two  formant  frequencies  at  centrally 
located  points  in  the  vowels  for  the  talkers  who  recorded  the 
nonsense  syllable  tests  are  given  In  Table  2.4-14.^^  Prom  this 
table  It  Is  observed  that  vowels  In  the  front  vowel  series  /I  i  e  •»  / 
are  characterized  by  successively  increasing  values  of  P^^  and  succes¬ 
sively  decreasing  values  of  P^.  For  the  back  vowel  series  /a  a  u  u/ 
the  first  two  formants  are  relatively  close  together  and  the  fre¬ 
quency  of  the  first  formant  progressively  decreases.  In  both  back 
and  front  vowel  series,  vowel  diiratlons  as  well  as  formant  fre¬ 
quencies  can  provide  cues  to  identification  of  the  vowel. 

The  long  vowels  /i®  a  u/  and  the  short  vowels  /i  eAu  /  can  be 
grouped  into  the  pairs  /l  u/,  /aea/,  /  l  u/,  and  /EA/.  Within  each 
pair  the  frequency  of  the  first  formant  is  about  the  same,  and  there 
is  a  difference  only  in  the  frequency  of  the  second  formant.  Thus, 
any  processing  that  results  in  distortion  or  attenuation  of  the 
spectrvun  in  the  range  of  the  second  formant  frequency  may  lead  to 
confusions  within  these  pairs  of  vowels. 
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Table  2.4-11 

List  of  Confusion  Matrices 
For  Bach  of  Six  Speech-Processing  Systems 

1.  Voiced-voiceless:  Tests  1  and  4 

2.  Interrupted -continuant  (voiceless):  Test  2 

3.  Interrupted-continiiant  (voiced);  Test  3 

4.  Voiced  stop-fricative-nasal  manner;  Test  5 

5.  Nasal-llquld-gllde  manner  (initial):  Test  7 

6.  Nasal-liquid  manner  (final):  Test  7 

7.  Voiceless  fricatives  /f  s  //:  Tests  2  and  4 

8.  Voiceless  fricatives  /f  9  s  //:  Test  6 

9.  Voiceless  stops  /p  t  k/:  Tests  1  and  2 

10.  Voiced  fricatives  /v  z  3/:  Tests  3  and  4 

11.  Voiced  fricatives  /v  z/:  Test  5 

12.  Voiced  stops  /b  d  g/:  Tests  1  and  3 

13.  Voiced  stops  /b  d/:  Teat  5 

14.  Nasals  /m  n/  (initial  and  final):  Tests  5  and  7 

15.  Nasals  /m  n  q/  (final):  Test  7 

16.  Liquids  /r  i/  (initial):  Test  7 

17.  Liquids  /r  i/  (final):  Test  7 

18.  Glides  /w  j/ :  Test  7 

19.  Consonant  clusters  (initial):  Test  9 

20.  Consonant  clusters  (final):  Test  8 

21.  Vowels  /lie  «/;  Test  10 

22.  Vowels  /o.^\J  u/;  Test  11 

23.  Vowels  /l  «  a  u/:  Test  12 

24.  Vowels  /i  €  A  v/;  Test  13 
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Table  2.4-12 

Probability  (tinea  100)  That  a  Stlm\ilus  la  Categorized 
Incorrectly  for  Each  of  the  24  lypea  of  Confualon  Llated 
In  Table  2.4-11.  The  Lettera  at  the  Top  of  Each  Colxann 


Identify  the  Varloua  Speech-Proceaalng  Syatema 


(See 

Key  in  Section  2.1) 

R 

N 

S 

P 

H 

N 

T# 

1. 

3.0 

2.4 

1.4 

3.6 

6.9 

11.1 

3.0 

2. 

7.7 

2.8 

6.7 

8.0 

19.4 

21.0 

2.9 

3. 

6.4 

5.7 

5.0 

9.8 

24.6 

31.4 

6.6 

4. 

7.4 

11.1 

10.0 

12.2 

25.4 

46.2 

5. 

4.2 

33.8 

12.2 

14.9 

27.3 

42.2 

6.8 

6. 

0 

9.3 

0.9 

1.2 

1.2 

16.8 

0.8 

7. 

21.9 

28.3 

17.4 

5.9 

11.8 

12.9 

11.2 

8. 

31.0 

43.7 

37.4 

25.1 

33.1 

30.9 

28.2 

9. 

14.8 

39.8 

15.0 

13.8 

30.1 

49.0 

9.9 

10. 

11.6 

26.0 

11.7 

3.4 

14.6 

31.6 

9.2 

11. 

8.5 

35.0 

14.4 

3.0 

10.6 

17.5 

19.5 

12. 

12.9 

31.5 

17.3 

5.5 

33.0 

60.6 

8.2 

13. 

3.6 

16.8 

9.9 

5.8 

22.0 

44.6 

14. 

5.5 

34.5 

12,8 

19.9 

23.5 

45.5 

15. 

20.0 

51.0 

25.6 

40.1 

39.2 

63.2 

13.0 

16. 

1.9 

20.7 

9.7 

8.7 

6.7 

50.0 

2.0 

17. 

0.6 

25.0 

0.8 

1.7 

11.1 

35.2 

0 

18. 

6,3 

5.3 

0.6 

0.4 

1.5 

41.1 

0 

19. 

8.3 

23.6 

27.2 

18.7 

24.0 

49.4 

14.0 

20. 

4.9 

23.4 

17.1 

13.6 

30.5 

35.8 

7.6 

21. 

2.1 

2.6 

0 

3.1 

8.5 

15.3 

22. 

2.9 

22.0 

2.3 

4.2 

16.1 

3.7 

23. 

7.6 

28.0 

10.2 

14.2 

17.4 

44.1 

24. 

4.2 

33.3 

9.8 

15.0 

24.8 

40.2 

♦  Data  for  the  Taaaroff-Daguet  system  are  derived  from  a 
restricted  series  of  tests. 
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Table  2.4-13 

Probability  (times  100)  of  Incorrect  Identification 
of  Different  Consonant  Peatijres,  as  Derived  from  the 
Results  of  the  Nonsense  Syllable  Tests 
Results  Represent  Averages  of  Data  in  Rows  of  Table  2.4-12^ 

as  Indicated 


Rows  of 


Feature  Table  2.4-12 

R 

N 

S 

P 

H 

M 

T* 

Voiced-voiceless 

1 

3.0 

2.4 

1.4 

3.6 

6.9 

11.1 

3.0 

Interrupted- 

continuant 

2,3 

7.0 

4.3 

5.8 

8.9 

22.0 

26.2 

4.8 

Other  manner 

4,5>6 

3.9 

18.1 

7.7 

9.4 

18.0 

35.1 

Voiced  stops 
+  fricatives 

10,12 

12.2 

28.8 

14.5 

4.5 

23.8 

46.1 

8.7 

Voiceless  stops 
+  fricatives 

1,9 

18.3 

34.0 

16.2 

9.8 

21.0 

31.0 

10.6 

Stops 

9.12 

13.8 

35.6 

16.2 

9.6 

31.6 

54.8 

9.1 

Fricatives 

1,10 

16.8 

27.2 

14.5 

4.6 

13.2 

22.3 

10.2 

Nasals 

14,15 

12.7 

42.8 

19.2 

30.0 

31.4 

54.4 

Liquids 
+  glides 

16,17,18 

2.9 

17.0 

3.7 

3.6 

6.4 

42.1 

0.7 

Clusters 

19,20 

6.6 

23.5 

22.2 

16.1 

27.2 

42.6 

10.8 

*  Data  for  the  Tasaroff-Daguet  system  are  derived  from  a 
restricted  series  of  tests. 
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Table  2.4-14 

Values  of  the  Krequencles  and  Pg  of  the  First 
and  Second  Pormants  for  Vowels  Occurring  In  Nonsense 
Syllables  of  the  lype  Used  In  the  Vowel  Tests.  Averages 
Are  Taken  for  the  Two  Talkers  Vfho  Recorded  the  Tests 


Vowel 

fl 

!2 

cps 

cps 

1 

300 

2150 

I 

430 

171C 

E 

530 

1700 

X 

670 

1630 

a 

690 

1200 

A 

600 

1300 

V 

450 

1290 

u 

290 

1230 
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2.5  Voice  Qviallty  Tests 

High  Intelligibility  scores,  obtained  for  a  particular  speech 
compression  system  under  a  given  set  of  conditions,  do  not  neces¬ 
sarily  Imply  that  the  speech  output  Is  natxiral  and  that  the  voice 
quality  Is  high.  A  particular  system  may  process  speech  In  such 
a  way  as  to  obscure  many  of  the  familiar  perceptual  cues  and 
Introduce  a  new  set  of  consistent  cues  which,  when  learned,  will 
help  the  listener  to  distinguish  between  speech  sovmds  that  might 
have  been  ambiguous  before  learning.  Listeners  who  are  familiar 
with  the  system  may  achieve  svirprlslngly  high  Intelligibility  scores 
and  yet  rate  the  same  system  as  Inferior  to  another  equally  Intel¬ 
ligible  system  on  the  basis  of  natviralness  or  voice  quality. 

Three  voice  quality  tests  of  the  paired-comparison  type,  one  for 
each  of  three  speakers,  were  recorded  and  administered  to  two  groups 
of  fifteen  students  each  at  Tufts  University.  These  tests  consisted 
of  42  randomized  sentence*  pairs  representing  all  possible  forward 
and  reverse  combinations  of  the  seven  systems  under  consideration. 
Identical  sentences  were  repeated  as  seldom  as  possible  (although 
some  repetition  was  necessary  because  each  speaker  read  only  five 
sentences  through  each  system),  and  the  test  material  and  speakers 
were  not  used  again  for  other  tests. 

IWo  further  voice  quality  tests  of  the  paired-comparison  type  were 
made  and  given  to  the  students.  These  tests  also  consisted  of  42 


*  The  sentences  were  the  so-called  Harvard  Test  Sentences. 


11 
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randomized  sentence  pairs >  but  the  sentence  used  was  always  the 
same  phrase:  ’’Number  thlrty-slx,  you  will  write  now”  (no  PB  word 
was  Included).  The  test  material  was  obtained  by  editing  appro¬ 
priate  PB  word  list  recordings. 

The  listeners  were  given  the  following  sets  of  Instructions: 

For  the  first  group  of  three  tests  —  "Each  test  Item  consists  of 
two  sentences.  During  the  pause  following  each  pair  you  are  to  write 
next  to  the  appropriate  Item  on  your  answer  sheet  the  number  3^  If 
you  think  the  voice  quality  of  the  first  sentence  of  the  pair  Is 
better,  and  the  nvunber  £  If  you  think  the  voice  quality  of  the  second 
sentence  Is  better.  Make  yoxu*  Judgments  on  the  Intelligibility  and 
naturalness  of  the  speech.  There  are  42  Items  In  each  test." 

For  the  second  group  of  two  tests  --  "Each  test  Item  consists  of 
the  sentence:  'Ntimber  thlrty-slx,  you  will  write  now,'  spoken  twice. 
During  the  pause  following  each  pair  you  are  to  write  next  to  the 
appropriate  Item  on  your  answer  sheet  the  number  1^  if  you  think  the 
voice  quality  of  the  first  sentence  of  the  pair  Is  better,  and  the 
nxanber  2  If  you  think  the  voice  quality  of  the  second  sentence  Is 
better.  Make  yoxir  Judgments  on  the  Intelligibility  and  naturalness 
of  the  speech.  There  are  42  Items  In  each  test." 

The  final  results  obtained  for  the  first  group  of  tests  are  given 
In  Table  2.5-1.  Each  entry  represents  the  number  of  times  that  a 
particular  system  was  preferred  to  all  other  systems  by  30  subjects 
for  three  tests  (speakers).  A  distinction  Is  made  as  to  whether  the 
sentence  from  the  preferred  system  occiirred  first  or  second  In  each 
sentence  pair.  From  an  examination  of  the  corresponding  columns  it 
appears  that  the  responses  have  essentially  no  time  error;  l.e.. 
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there  la  no  obvious  tendency  for  preference  of  the  second  sentence 
In  the  pairs  regardless  of  the  system  used. 

Table  2.5-1 


Resxilts 

System 

of  Voice  Quality  Tests  Using  Different  Sentences > 
for  Three  Speakers  and  Thirty  Subjects 

Preferred  Sentence  Preferred  Sentence  Total  Relative 
First  In  Pair  Second  In  Pair  Preference 

Melpar 

57 

22 

79 

Stromberg 

372 

436 

808 

Phllco 

291 

253 

544 

Hughes 

201 

206 

407 

Tasaroff-Daguet  299 

290 

589 

Narrow-Band 

236 

217 

453 

Reference 

421 

472 

893 

No.  of  blanks 

No.  of  opportunities  to  respond 

_ 7 

3780 

The  relative  preference  for  the  seven  systems  Is  also  shown  In 
graphical  form  In  Pig.  2.5-1.  This  figure  was  prepared  from  the 
right-hand  column  of  Table  2.5-1. 
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The  final  results  obtained  for  the  second  group  of  two  tests  (using 
the  phrase  "Number  thirty-six,  you  will  write  now") are  given  in 
Table  2.5-2.  Again  each  entry  represents  the  number  of  times  that 
a  particular  system  was  preferred  to  all  other  systems  by  30  sub¬ 
jects  for  two  tests  (speakers).  In  this  case  there  is  a  slight 
indication  of  time  error  in  the  responses. 

The  relative  preference  for  the  systems  on  the  basis  of  voice 
qxxallty  tests  using  the  PB  carrier  phrase  Is  shown  In  graphical 
form  In  Pig.  2.5-2.  This  figure  was  prepared  from  the  right-hand 
column  of  Table  2.5-2. 

It  will  be  noted  that,  except  for  a  change  between  the  orders  of 
the  scores  for  the  Phllco  and  Tasaroff-Daguet  systems,  the  various 
systems  tested  are  rank  ordered  in  the  same  way  for  both  tests  of 
voice  quality. 


Table  2.5-2 

Results  of  Voice  Qviallty  Tests  Using  the  Same  Sentence 
(PB  Carrier  Phrase),  for  Two  Speakers  and  Thirty  Subjects 


Preferred  Sentence  Preferred  Sentence  Total  Relative 
System  First  In  Pair  Second  In  Pair  Preference 


Melpar 

19 

24 

43 

Stromberg 

203 

260 

463 

Phllco 

175 

225 

400 

Hughes 

100 

117 

217 

Tasaroff-Daguet 

173 

215 

388 

Narrow-Band 

165 

199 

364 

Reference 

303 

341 

644 

No.  of  blanks 

No.  of  opportunities 

to  respond 

1 

2520 
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FIG.  2.5  -2  RESULTS  OF  QUALITY  TESTS 
WITH  SAME  SENTENCE  (PB 
CARRIER  PHRASE) 
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2.6  Talker  Identification  Tests 

One  particular  measurement  of  the  quality  of  the  output  of  a  system 
Involves  the  ability  of  listeners  to  recognize  specific  talkers.  A 
high  qxiallty  system  will  leave  Intact  those  features  of  the  speech 
wave  which  carry  Information  about  the  talker,  such  as  pitch  varia¬ 
tions,  articulatory  characteristics,  and  stress  patterns. 

The  present  talker  Identification  tests  may  be  conveniently  divided 
Into  two  groups.  The  first  group  of  14  tests  was  recorded  with  two 
quartets  of  speakers,  so  that  each  quartet  could  be  tested  over  each 
of  the  seven  systems.  All  tests  commence  with  each  of  the  fotir 
talkers  Identifying  himself  by  number  and  reading  two  training  sen¬ 
tences.  The  test  proper  consists  of  20  randomized  test  sentences 
that  constitute  20  items.  Following  a  suitable  pause  after  each  Item, 
dvirlng  which  the  subjects  record  their  responses,  the  previous  talker 
Identifies  himself.  This  format  ensures  that  learning  continues 
throughout  the  course  of  the  test. 

The  listeners  were  provided  with  the  following  set  of  Instructions: 

"Before  the  test  begins,  you  will  hear  fovu?  talkers  read  two  train¬ 
ing  sentences  each  to  familiarize  you  with  their  voices.  Each  talker 
will  identify  himself  by  a  number.  A  test  Item  will  consist  of  a 
sentence  read  by  one  of  the  four  talkers.  During  the  pause  which 
follows  each  sentence  you  are  to  write  the  number  of  the  talker  you 
believe  spoke  next  to  the  appropriate  Item  on  your  answer  sheet.  At 
the  end  of  the  pause  the  actual  talker  will  Identify  himself.  If 
yotir  answer  was  correct,  make  a  check  to  the  right  of  It.  If  your 
answer  was  incorrect,  do  not  make  any  mark  but  wait  for  the  next  Item. 
There  are  20  Items  In  the  test.” 
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This  group  of  tests  was  administered  to  a  crew  of  30  listeners  at 
Tufts  University. 

The  second  group  of  eight  talker  Identification  tests  (foxir  tests 
were  recorded  by  each  of  two  quartets)  was  administered  to  only  15 
listeners.  These  tests  were  master  recorded  at  Melpar^  Inc.  using 
their  microphone  facilities.  While  recordings  were  being  made  from 
the  Input  to  the  Melpar  system,  the  system  output  was  slmxiltaneously 
recorded  for  two  tests.  Later,  two  further  tests  that  were  recorded 
from  the  Input  to  the  Melpar  system  were  played  back  through  the 
same  system,  and  the  output  was  again  recorded.  Dubbings  of  four 
additional  tests,  originally  recorded  from  the  Input  to  the  Melpar 
system,  were  also  played  back  through  the  Hughes  and  Phllco  channel 
vocoders.  The  tests  In  this  second  group  have  the  same  structure 
as  those  In  the  former  group,  except  that  each  of  the  four  talkers 
read  five  (instead  of  two)  training  sentences  before  the  test  Items 
began. 

One  major  difference  between  the  talker  Identification  tests  recorded 
at  our  laboratories  and  those  recorded  at  Melpar,  Inc.  is  that  for 
oxH*  recordings  the  microphone  was  suspended  In  the  center  of  a  circle 
of  four  talkers,  about  30  inches  from  each  talker's  lips,  whereas 
the  microphone  in  the  Melpar  recordings  was  passed  among  the  talkers 
and  held  1  to  2  inches  from  the  lips. 

Although  the  tests  are  of  the  self -scoring  type,  most  subjects  were 
unable  to  correct  their  tests  properly  as  they  took  them.  Frequently 
the  subjects  did  not  give  themselves  credit  where  It  was  due  and 
about  equally  often  they  gave  themselves  credit  where  their  response 
was  Incorrect.  This  situation  may  be  explained  in  terms  of  the  fact 
that  the  correct  answer,  having  been  processed  by  the  system  under 
test,  is  not  always  clearly  Intelligible  to  the  listener. 
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The  mean  scores  (correct  identification  in  percent)  for  the  two 
groups  of  tests  are  given  in  Tables  2.6-1  and  2.6-2.  These  scores 
were  obtained  by  correcting  the  subjects'  "self-scored"  answer 
sheets  with  the  aid  of  appropriate  lists  that  were  used  dvirlng  the 
recording  sessions. 


Table 

2.6-1 

Mean  Scores 

(Correct  Talker 

Identification  in  Percent) 

for  30  Subjects 

Tests 

Master  Recorded  at  BBN 

System 

Quartet  I 

Quartet  II 

Average  for 
Both  Quartets 

Melpar 

27 

37 

32 

Stromberg 

41 

54 

48 

Phil CO 

43 

35 

39 

Hughes 

33 

31 

32 

Tasaroff-Daguet 

45 

44* 

45 

Narrow-Band 

45 

48 

47 

Reference 

54 

59 

57 

♦Quartet  III 

Table 

2.6-2 

Mean  Scores  (Correct  Talker  Identification  in  Percent) 

for  15  Subjects 

Tests  Master  Recorded  at  Melpar,  Inc. 


System  and 
Condition 

Quartet  IV 

Quartet  V 

Average  for 
Both  Quartets 

Melpar (live) 

43 

48 

46 

Melpar 

31 

31 

31 

Phil CO 

47 

55 

51 

Hughes 

39 

41 

40 
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An  analysis  of  variance  of  test  score  distributions  was  \indertaken 
for  the  first  group  of  tests  (see  Tables  5.2-5  and  5.2-7).  The 
data  obtained  for  each  quartet  were  then  examined  to  determine  what 
mean  score  differences  are  statistically  significant  (see  Tables 
5.2-6  and  5.2-8).  The  order  of  some  systems  that  fell  Into  groups 
with  Insignificant  differences  between  mean  scores  has  been  modi¬ 
fied  for  both  q\iartets  In  order  to  arrive  at  a  rank  order  which  has 
general  validity.  This  rank  order  Is  shown  In  Table  2.6-3*  to¬ 
gether  with  the  significant  differences  between  system  scores  for 
quartets  I  and  II. 


Table  2.6-3 

Rank  Order  of  Systems*  from  Best  to  Worst, 
According  to  Significant  Differences  between  Scores 
Obtained  for  Qtiartets  I  and  II.  Tests  Master  Recorded  at  BBN 


Significant* 
Difference 
Quartet  I 


Yes 

No 

No 

No 

Yes 

Yes 


System 
Reference 
Stromberg 
Narrow-Band 
Tasaroff-Daguet^ 
Phllco 
Hughes 
Melpar 


Significant* 
Difference 
Quartet  II 


No 

Yes 

No 

Yes 

No 

No 


(*Statlstlcally  significant  at  the  p  i  0.05  level  of  confidence.) 
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The  results  obtained  for  the  second  group  of  tests,  master  recorded 
at  Melpar,  Inc.,  indicate  that  the  use  of  the  microphone  for  which 
the  Melpar  system  was  designed  does  not  Improve  the  low  rating  of 
this  system.  The  scores  for  the  Hughes  and  Phllco  systems  are,  how¬ 
ever,  improved  somewhat  by  using  the  Melpar  microphone  facilities. 
Talker  Identification  via  the  Melpar  system  is  significantly  Improved 
when  the  system  is  tested  live,  l.e.,  when  the  recording  microphone 
is  substituted  for  a  pre-recorded  tape. 

2.7  Comprehension  of  Continuous  Speech  in  Noise 

The  context  in  which  an  unintelligible  word  occurs  is  often  Important 
because  it  may  help  the  listener  to  resolve  his  doubts  about  the 
ambiguous  word.  A  test  in  which  listeners  are  asked  whether  they  can 
make  out  the  "essense"  of  a  message  may  therefore  result  in  much 
higher  scores  than  a  test  in  which  the  Intelligibility  of  isolated 
words  is  measured.  Tests  dealing  with  the  comprehension  of  contin¬ 
uous  speech  are  different  from  Intelligibility  and  quality  testa; 
they  must  be  considered  in  a  separate  category. 

The  relative  comprehension  of  continuous  speech  was  measvired  for 
various  slgnal-to-nolse  ratio  conditions  at  the  Inputs  of  the  seven 
speech  compression  systems  considered.  One  test  consisting  of  49 
samples  of  continuous  speech,  each  set  against  a  constant  background 
of  noise,  was  administered  to  a  crew  of  35  listeners.  Each  test 
sample  had  a  duration  of  25  seconds  and  a  slgnal-to-nolse  ratio  that 
was  either  30,  25,  20,  15,  10,  5  or  0  db.  The  subjects  were  Instructed 
to  mark  the  items  on  their  answer  sheets  with  an  A  if  they  could  "make 
out  almost  every  word,"  with  a  B  if  they  could  "make  out  only  a  few 
words,"  and  with  a  C  if  they  could  "make  out  almost  no  word  at  all." 

The  speech  material,  which  was  recorded  by  a  male  talker  under 
quiet  conditions,  was  selected  from:  A.  Smith,  The  Wealth 
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of  Nations.*  An  Altec-Lanslng  Model  661A  dynamic  microphone  was 
positioned  10  inches  from  the  talker's  lips.  The  noise  had  a 
spectriun  similar  to  that  of  the  long-time  average  of  speech  and 
was  electrically  added  to  the  recorded  speech  signal. 

The  results  for  this  test  are  shown  in  Table  2.7-1*  and  in  graph¬ 
ical  form  in  Pig.  2.7-1.  For  a  given  system  and  a  partlcvilar 
slgnal-to-nolse  ratio,  the  score  A  was  weighted  2  points,  the 
score  B,  1  point,  and  the  score  C,  zero.  The  ordinate  scale  in 
Pig.  2.7-1  has  been  arbitrarily  selected  so  that  the  reference 
system  scores  lOOJg  (relative  comprehension)  for  a  slgnal-to-nolse 
ratio  of  30  db. 

Although  the  results  vary  somewhat  at  different  slgnal-to-nolse 
ratios,  it  appears  that  the  Hughes,  Philco^  Tasaroff-Daguet  and 
Narrow-Band  systems  perform,  on  the  average,  about  equally  well 
on  this  test.  The  Reference  system  and  the  Stromberg  semi-vocoder 
are  much  less  adversely  affected  by  noise  than  are  the  other  systems, 
and  the  Melpar  formant  vocoder  performs  very  poorly  in  noise.  One 
possible  explanation  for  the  relatively  poor  performance  of  the 
formant  vocoder  is  that  the  talker  reads  at  a  rate  that  is  too  high 
for  proper  operation  of  the  pitch  extractor  and  formant- tracking 
circuits  in  the  device.  The  Melpar  vocoder  appeu^ently  has  considerable 
difficulty  in  tracking  the  fundamental  and  formant  frequencies  for 
speech  spoken  at  a  fast  rate.  It  will  be  recalled  that  for  PB  word 
tests,  where  each  test  word  is  spoken  individually.  Intelligibility 
was  still  appreciable  in  a  15  db  slgnal-to-nolse  ratio  condition, 
whereas  relative  comprehension  of  this  continuous  discourse  is 
essentially  zero  under  similar  noise  conditions. 


*  E.  P.  Dutton  and  Co.,  Inc.,  New  York,  1937 
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Table  2.7-3 

Results  of  Relative  Comprehension  Test 
Each  Entry  Represents  a  Cumulative  Point  Score  for  35  Subjects 


System  Signal- to-Nolse  Ratio  (db) 


30 

25 

20 

15 

10 

5 

0 

Melpar 

33 

18 

7 

0 

0 

0 

0 

Stromberg 

70 

69 

67 

68 

41 

2^^ 

0 

Phllco 

68 

70 

48 

29 

18 

0 

0 

Hughes 

62 

60 

63 

11 

0 

0 

Tasarof f - Dague t 

65 

52 

65 

31 

23 

5 

1 

Narrow -Band 

65 

61 

42 

19 

1. 

1 

Reference 

70* 

70 

ro 

66 

55 

23 

3 

*A11  35  subjects  rated 

this 

sample  "A" 

'2  points) . 
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SIGNAL-TO-NOISE  RATIO  AT  SYSTEM  INPUT 
KEY 

M  -  MELPAR 
S  - STROMBERG 
P  -  PHILCO 
H  -  HUGHES 
T  -  TASAROFF 
N  -  NARROW  BAND 
R  -  REFERENCE 

FIG.  2.7-1  FINAL  RESULTS  OF  RELATIVE 
COMPREHENSION  TEST,  BASED 
ON  DATA  GIVEN  IN  TABLE  2.7-1 
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3.  EVALUATION  OF  SPEECH  COMPRESSION  SYSTEMS  TESTED 
3.1  Reference  (Low-Pass)  System 

The  reference  system  consists  of  a  low-pass  filter  with  a  cut-off 
frequency  of  1500  cps.  The  slope  of  the  filter  characteristic 
above  this  frequency  Is  36  db/octave,  and  thus  the  gain  Is  down 
about  l8  db  at  2100  cps  and  36  db  at  3000  cps. 

The  types  of  errors  In  Identification  of  vowels  and  consonants 
In  nonsense  syllables  for  the  low-pass  system  are  those  that  would 
be  expected  In  view  of  the  lack  of  high-frequency  data  In  the  speech 
signal.  There  are  a  few  errors  In  voicing  and  manner  of  production, 
especially  for  fricative  and  stop  consonants  since  the  high-fre¬ 
quency  energy  for  such  consonants  apparently  has  some  cue  value 
for  these  distinctions.  Errors  In  manner  of  production  for  nasals, 
liquids  and  glides  are  relatively  small,  since  these  voiced  sounds 
have  little  high-frequency  energy. 

The  Identification  of  place  of  consonant  production  for  the  low- 
pass  system  Is  relatively  good  for  consonauits  with  appreciable 
low-frequency  energy,  namely  the  nasals,  liquids  and  glides.  For 
voiceless  fricatives,  on  the  other  hand,  the  number  of  errors  Is 
large  (22  percent  for  /fs//,  since  distinctions  among  these  con¬ 
sonants  are  apparently  made  primarily  on  the  basis  of  high-fre¬ 
quency  spectrum  shape.  The  errors  In  place  for  voiceless  stops 
and  for  voiced  fricatives  and  stops  are  In  the  range  12-15  percent, 
l.e.,  somewhat  less  than  for  voiceless  fricatives;  cues  for  these 
consonants  are  carried  in  part  by  low  frequencies.  Including  the 
vowel  transitions,  and  In  part  by  high  frequencies.  As  for  the 
glides,  /j/  Is  frequently  Identified  as  /w/,  since  the  hlgh-fre- 
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quency  energy  concentration  in  /J/  Is  not  passed  by  the  filter. 

For  vowels^  the  number  of  errors  is  considerably  smaller  than  for 
any  of  the  other  systems.  Such  a  result  is  to  be  expected,  since 
the  Important  cues  for  vowels  --  the  frequency  locations  for  the 
first  two  formants  —  are  known  to  lie  below  2500  cps. 

In  general,  most  of  the  error  scores  for  nonsense  syllables  pro¬ 
cessed  by  the  reference  system  were  comparable  to  or  lower  than 
those  for  all  other  systems.  The  scores  for  this  system  were  sig¬ 
nificantly  poorer  than  those  for  other  systems  only  for  voiceless 
fricatives.  The  tests  other  than  nonsense  syllables  also  demon¬ 
strate  the  superior  performance  of  the  reference  system  relative 
to  all  other  systems.  This  superiority  seems  especially  clear-cut 
for  the  voice  quality  tests  (Pigs.  2.5-'l  and  2.5-2)  and  for  the 
talker  identification  tests  summarized  in  Table  2.6-3.  (The  scores 
for  PB  tests  with  the  reference  system  do  not  greatly  exceed  those 
for  come  of  the  other  systems  if  PB  scores  for  the  reference  sys¬ 
tem  are  adjusted  downward  by  5  to  10  percentage  points  to  coi-rect 
for  learning  effects.)  Such  a  result  is  reasonable  if  it  is  hy¬ 
pothesized  that  voice  quality  and  ability  to  identify  talkers  are 
preserved  if  the  temporal  properties  of  the  signal,  particularly 
the  detailed  properties  of  the  quasl-perlodlc  voice  source,  arc  not 
distorted.  The  validity  of  this  hypothesis  will  be  examined  further 
as  data  from  other  systems  are  discussed. 

3 . 2  Channel  Vocoder 

The  vocoder  technique  is  based  on  a  model  that  views  the  generation 
of  speech  as  the  excitation  of  the  voca^  ^avitles  by  either  a  noise 
source  or  a  quasl-periodic  buzz  source.  This  generation  process 
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is  simulated  in  the  channel  vocoder  by  using  electrical  versions 
of  one  or  another  of  such  sources  as  excitation  for  a  bank  of  fij.- 
ters  and  by  varying  the  gain  of  each  filter  to  provide  an  appro¬ 
priate  spectral  output  as  a  function  of  time.  If  a  meucimum  amount 
of  bandwidth  compression  is  to  be  achieved,  the  nximber  of  filters 
must  not  be  too  large,  and  the  rate  of  chsuige  of  the  signals  that 
specify  the  gains  of  the  individual  filters  must  be  limited.  Thus, 
any  vocoder  design  miist  represent  a  compromise  between  the  quality 
and  Intelligibility  of  the  speech  output  on  one  hand  and  the  channel 
capacity  on  the  other. 

The  bandwidth  and  Information  rate  required  for  a  channel  vocoder 

g 

have  been  estimated  previously.  The  bandwidth  of  the  transmitted 
signal  corresponding  to  each  filter  channel  is  considered  to  be 
about  25  cps,  and  the  bandwidth  required  for  the  signal  that  indi¬ 
cates  the  fundamental  frequency  is  apparently  also  about  25  cps. 

For  a  16-channel  vocoder,  therefore,  the  total  bandwidth  is  about 
425  cps.  In  the  case  of  a  digitized  vocoder,  sampling  rates  of 
about  45  samples/sec  for  each  channel,  with  3-bit  amplitude  speci¬ 
fication,  seem  adequate  for  the  channel  signals,  whereas  6  bits 
are  necessary  for  the  "pitch"  signal.  Thus  the  total  information 
rate  is  about  2400  bit s/sec. 

The  following  discussion  of  the  channel  vocoder  performance  is 
based  on  the  results  obtained  for  the  Phllco  l6-channel  vocoder. 

The  filters  in  this  vocoder  are  uniformly  spaced  up  to  1000  cps, 
with  bandwldths  of  133  cps,  and  are  logarithmically  spaced  above 
1000  cps  to  an  upper  frequency  limit  of  38OO  cps.  [it  was  evident 
in  the  course  of  the  testing  program  that  the  12-channel  digitized 
Hughes  Vocoder  had  a  malfunction,  and  thus  the  results  for  that 
system  were  not  considered  to  be  representative  of  the  performance 
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of  channel  vocoders  In  general.  Several  aspects  of  the  test  data 
provide  evidence  that  the  Hughes  Vocoder  was  not  operating  In  an 
optimum  way.  For  example,  the  average  PB  word  score  In  quiet  for 
two  male  talkers  was  only  6l  percent  (Plg.  2.3-2),  whereas  for  a 
slgnal-to-nolse  ratio  of  15  db  the  average  score  (for  one  talker) 
was  much  higher  —  73  percent  (Pig.  2.3-^).  Informal  tests  of  the 
Hughes  system  for  analog  operation  showed  greatly  Improved  perfor¬ 
mance  compared  with  digital  operation  for  talkers  In  quiet  —  of 
the  order  of  15  to  20  percent  Improvement  for  PB  words.  Apparently, 
therefore,  the  malfunction  was  In  the  digital  circuits.] 

Some  Inherent  limitations  In  present  channel  vocoder  techniques 
are  evident  from  the  data  for  the  nonsense  syllable  tests.  Con¬ 
sider  first  the  voiced- VO Iceless  distinction,  for  which  the  pro¬ 
bability  of  error  Is  3.6  percent  for  the  Phllco  Channel  Vocoder. 

It  Is  suggested  that  the  necessity  of  making  this  binary  decision 
In  the  vocoder  analyzer  Introduces  an  error  that  Is  difficult  to 
reduce  much  below  this  value  without  appreciable  Increase  In  com¬ 
plexity.  Presumably  part  of  the  voiced-voiceless  error  Is  due  to 
the  limitation  that  buzz  and  noise  source  cannot  exist  simultaneously 
In  the  vocoder,  whereas  both  sources  may,  of  course,  exist  for  cer¬ 
tain  speech  sounds.  Part  of  the  error  Is  probably  also  attributable 
to  the  difficulty  of  devising  a  pitch  extractor  that  operates  re¬ 
liably  and  with  negligible  delay.  A  delay  of  as  little  as  10  to 
20  msec  In  operation  of  the  buzz  source  could,  for  example,  easily 
result  In  a  voiced  fricative  or  stop  being  called  voiceless.  In 
contrast  to  the  conventional  channel  vocoder  are  the  semi-vocoder, 
the  reference  system,  the  Tasaroff-Daguet  system,  and  the  spectrum 
sampling  system,  none  of  which  recjulre  a  buzz-hiss  decision  or  pitch 
extraction.  These  systems  all  have  fewer  errors  In  the.'  voiced- 
voiceless  distinction  than  does  the  channel  vocoder. 
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The  feature  Interrupted-continuant  is  detected  incorrectly  about 
8.9  percent  of  the  time  for  the  channel  vocoder.  Errors  In  this 
feature  may  arise  because  the  modulators  controlling  the  individual 
filter  outputs  cannot  change  rapidly  due  to  limitations  in  band¬ 
width  and  sampling  rate  of  the  transmission  signals.  Note,  for  ex¬ 
ample,  that  voiced  stops  are  frequently  called  voiced  fricatives, 
since  the  rapid  changes  inherent  in  the  stops  cannot,  apparently, 
be  reproduced  in  the  vocoder.  The  limited  dynamic  range  of  the 
system  apparently  also  contributes  to  errors  in  the  interrupted- 
continuant  feature;  weak  sounds  such  as  /f/  are  often  not  reproduced 
at  all,  with  the  result  that  they  are  heard  as  stops.  Arguments 
similar  to  these  could  be  given  to  indicate  probable  causes  of 
error  in  Judging  manner  of  production  between  nasals,  liquids  and 
glides  (about  8  percent  probability  of  error).  To  some  extent  the 
grossness  of  the  reproduced  frequency  spectrum,  as  well  as  inade¬ 
quacies  in  reproduction  of  temporal  variations  of  the  spectrum, 
could  contribute  to  errors  in  this  decision. 

Errors  in  Judging  place  of  production  of  many  of  the  consonants 
and  vowels  processed  by  the  channel  vocoder  are  attributable  largely 
to  the  fact  that  the  acoustic  spectrum  is  analyzed  and  reproduced 
only  grossly  by  the  filter  banks.  For  voiceless  and  voiced  frica¬ 
tives,  these  errors  seem  relatively  small  (5-9  percent  and  3.2  per¬ 
cent  for  these  tests),  indicating  that  the  gross  spectral  features 
as  reproduced  by  the  channel  vocoder  are  adequate  for  these  classes 
of  sounds. 

In  cases  where  Importaint  cues  are  contributed  by  rapidly  changing 
spectral  features,  however,  the  performance  of  the  channel  vocoder 
is  less  3atlsfactoi?y.  Examples  are  the  /f/  -  /e/  distinction  (37 
percent  errors)  which  seems  to  depend  on  formant  transitions,  the 
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voiceless  stops  (l4  percent),  the  voiced  stops  (17  percent),  and 
the  nasals  (18  percent  errors  for  Initial  and  40  percent  for  final). 
For  these  groups  of  sounds,  both  the  grossness  of  the  spectral 
representation  and  Inadequate  reproduction  of  rapid  temporal  changes 
contribute  to  the  errors. 

As  for  the  vowels,  relatively  few  errors  are  made  In  the  tests 
Involving  the  front  vowel  sequence  /l  i  e  Sft  /  and  the  back  vowel 
sequence  /u  u  a  a/.  Vowels  within  these  groups  differ  In  first 
formant  frequency,  and  hence  have  quite  different  over-all  spectrum 
shapes.  The  spectra  are  apparently  reproduced  with  sufficient 
accuracy  to  permit  discrimination  among  these  vowels.  For  the  long 
vowels  /l  ae  a  u/  and  short  vowels  ,  I  e.  A  U  /,  however,  the  num¬ 
ber  of  errors  Is  much  greater  (l4  and  15  percent,  respectively). 

Most  of  the  errors  are  made  within  pairs  of  vowels  that  have  rovighly 
the  same  first  formant  frequencies  and  differ  only  in  the  second 
formant  frequencies.  Examples  of  such  pairs  are  /x  a/,  /s  r\  /  and 
/i  u/. 

For  the  Phllco  vocoder,  the  PB  word  Intelligibility  In  quiet  for 
male  talkers  was  85  percent,  and  thus  was  comparable  to  that  of 
the  semi-vocoder.  As  noted  above,  there  Is  some  loss  in  Intelli¬ 
gibility  In  the  channel  vocoder  relative  to  the  semi-vocoder  due 
to  errors  In  reproducing  excitation  characteristics.  However,  the 
filter  channels  for  the  Phllco  channel  vocoder  extend  to  a  higher 
frequency  than  those  for  the  Stromberg  semi-vocoder,  and  hence  the 
errors  for  certain  sounds  with  high-frequency  energy  are  greater 
for  this  semi-vocoder  than  for  the  channel  vocoder.  Furthermore, 
the  method  of  extracting  an  excitation  signal  in  the  semi-vocoder 
may  not  yield  a  sufficiently  broad-band  signal,  and  thus  there  may 
be  a  lack  of  high  frequencies  In  the  speech  synthesized  at  the  re- 
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ceiver.  The  result  Is  that  the  overall  Intelligibilities  for  the 
two  systems  are  comparable. 

The  PB  scores  for  female  talkers  are  much  lower  thain  those  for 
male  talkers  (59  versus  85  percent).  This  deterioration  Is  probably 
attributable,  at  least  In  part,  to  poor  performance  of  the  pitch 
extractor  for  female  voices  for  which  the  fundamental  frequency  Is 
high. 

In  the  quality  tests  and  in  the  talker  Identification  tests,  the 
performance  of  the  semi-vocoder  was  significantly  superior  to  that 
of  the  channel  vocoder,  as  shown  In  Pigs.  2.5-l>  2.5-2,  and  Table 
2.6-3.  These  data  provide  further  evidence,  therefore,  that  good 
voice  quality  and  correct  talker  Identification  depend  on  maintain¬ 
ing  an  accurate  replica  of  the  temporal  properties  of  the  voice  ex¬ 
citation.  For  any  system  In  which  the  fundamental  frequency  must 
be  extracted  and  the  source  must  be  reconstituted  at  the  receiver, 
the  voice  quality  deteriorates. 

3.3  Semi-Vocoder 

The  principle  of  operation  of  the  semi-vocoder  Is  similar  to  that 

of  the  channel  vocoder,  with  the  exception  that  no  pitch  extractor 

47 

or  volce-hlss  detector  Is  required  In  the  former.  '  In  the  particu¬ 
lar  version  tested,  a  baseband  covering  the  frequency  range  250  to 
750  cps  Is  transmitted  directly,  and  by  various  distortion  means 
at  the  receiver  this  signal  Is  used  to  form  a  relatively  flat- 
spectrum  excitation  signal  for  a  set  of  13  conventional  vocoder 
channels  In  the  frequency  range  750  to  3250  cps.  The  baseband  Is 
also  reproduced  directly  at  the  receiving  end  of  the  link  and  Is 
mixed  with  the  signal  synthesized  by  the  filters. 
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The  analog  semi-vocoder  that  was  tested  required  a  bandwidth  of 
about  900  cps  for  transmission  of  both  baseband  and  vocoded  sig¬ 
nals,  Including  guard  bands.  fThe  system  was  actually  designed 
in  such  a  way  that  three  sets  of  semi-vocoder  signals,  suitably 
multiplexed,  could  be  transmitted  via  a  conventional  telephone 
link. )  The  Information  rate  required  for  a  digitized  version  of 
such  a  semi-vocoder  has  been  estimated  to  be  about  65OO  bits  per 
sec. 

As  noted  in  the  discussion  of  the  channel  vocoder,  the  number  of 
errors  in  the  nonsense  syllable  tests  for  the  voiced- voiceless 
distinction  (1.4  percent)  is  much  smaller  for  the  semi- vocoder 
than  for  the  conventional  channel  vocoder.  The  buzz-hiss  distinc¬ 
tion  is  made  automatically  in  the  semi-vocoder  analyzer,  and  no 
decision  process  is  required  in  the  equipment.  Likewise,  the 
features  Involving  manner  of  production  are  identified  with  a 
somewhat  better  score  for  the  semi-vocoder  than  for  the  channel 
vocoder,  indicating  the  improvement  associated  with  direct  trans¬ 
mission  of  the  baseband. 

With  regard  to  Identification  of  place  of  production,  the  errors 
for  voiceless  consonants  are  relatively  high  since  the  upper  fre¬ 
quency  range  for  the  semi-vocoder  tested  was  only  3250  cps.  Also, 
as  noted  previously,  there  may  be  some  difficulty  in  deriving  from 
the  250-750  cps  baseband  a  suitable  noise-excitation  signal  with 
enough  high-frequency  energy.  Direct  transmission  of  the  base¬ 
band  helps  to  reduce  the  errors  in  identifying  place  of  production 
of  voiced  sounds,  particularly  the  nasals,  glides,  liquids  and 
vowels.  The  greatest  number  of  errors  for  vowels  occurs  for  cases 
where  two  vowels  have  roughly  the  same  first-formant  frequencies 
but  have  minimal  differences  in  second- formant  frequencies,  such 
as  the  pairs  /l-u/,  /*-a/,  /i-u/,  and  /e-A/. 
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The  baseband  in  the  semi-vocoder  tested^  i.e.,  250-750  cps,  was 
apparently  selected  on  the  assumption  that  the  speech  input  was 
restricted  at  low  frequencies  by  a  carbon  microphone  and  that  the 
equipment  would  be  used  by  male  talkers.  For  the  high  fundamental 
frequency  used  by  female  talkers,  the  baseband  might  contain  two 
and  sometimes  only  one  harmonic  of  the  fundamental.  Under  such 
circumstances,  it  becomes  difficult  to  devise  a  distorting  circuit 
that  yields  an  excitation  signal  with  a  flat  spectrum  envelope, 
indicating  equal  amplitude  for  all  harmonics  of  the  fundamental. 

The  sharp  drop  in  Intelligibility  of  the  semi-vocoder  output  for 
female  talkers  relative  to  male  talkers  (56  and  86  percent,  res¬ 
pectively)  reflects  this  limitation. 

When  speech  was  mixed  with  noise  at  the  semi- vocoder  Input,  the 
PB  word  intelligibility  for  a  15  db  signal-to-noise  ratio  decreased 
from  86  to  70  percent  for  male  voices,  while  the  "relative  compre¬ 
hension"  of  continuous  speech  showed  only  a  small  decrease  for  the 
same  noise  conditions.  It  appears  that  the  feature's  jipon  which  the 
listener  bases  his  Judgments  in  the  comprehension  test  Include, 
to  a  large  extent,  the  stress  and  intonation  pattern.  These  patterns 
are  preserved  reasonably  well  in  the  case  of  the  semi-vocoder,  even 
though  some  other  features  may  be  partially  obscured  by  the  noise. 

The  results  of  the  quality  and  speaker-identification  tests  also 
reflect  the  degree  to  which  these  patterns  are  preserved  in  the 
semi-vocoder.  The  semi-vocoder  ranks  highest  in  both  types  of 
tests  when  compared  to  the  other  compression  systems. 

3.4  Fonnant  Vocoder 

The  formant  vocoder  represents  a  modification  of  the  basic  channel 
vocoder.  As  in  the  channel  vocoder,  a  circuit  in  the  analyzer  makes 
the  distinction  between  buzz  and  hiss  excitation*,  a  signal  propor- 
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tional  to  the  fundamental  frequency  Is  extracted  for  intervals  in 
which  there  is  buzz  excitation,  and  this  signal  is  transmitted  to 
the  receiver.  Spectral  information  is  described  by  a  relatively 
small  number  of  parameters  that  indicate  certain  salient  spectral 
features.  During  non-nasal  vowel  or  vowel-like  sounds,  two  or  three 
of  these  signals  are  supposed  to  Indicate  the  frequencies  of  the 
lowest  two  or  three  vocal-tract  resonances  or  formants.  In  some 
versions  of  formant  vocoders,  the  amplitudes  of  the  resonances  are 
specified  as  well  as  their  frequencies.  When  vowels  are  nasalized, 
and  at  times  when  the  vocal-tract  excitation  is  not  at  the  glottis, 
the  way  in  which  spectral  Information  is  extracted  and  synthesized 
is  somewhat  different  from  one  version  of  the  formant  vocoder  to 
another. 

For  the  formant  vocoder  tested  in  this  program,  seven  parameters 
were  extracted  at  the  analyzer;  amplitudes  and  frequencies  of  for¬ 
mants  1  and  2,  location  of  a  high-frequency  "fricative  formant," 
amplitude  of  high-frequency  portion  of  the  signal,  and  fimdamental 
frequency.  The  analog  bandwidth  required  for  each  of  these  channels 
was  estimated  to  be  20  cps,  giving  a  total  analog  bandwidth  of  l4o 
cps.  For  digital  operation,  the  sampling  rate  was  43.5  cps,  and 
all  parameters  except  fundamental  frequency  were  quantized  to  3  bits; 
5  bits  were  used  to  code  fvindamental  frequency.  Thus,  the  total 
information  rate  was  1000  bits/sec. 

While  the  details  of  the  arialysls  aind  synthesis  procedures  may  vary 
greatly  from  one  type  of  formant  vocoder  to  another,  the  data  ob¬ 
tained  for  the  particular  version  tested  in  the  present  program  seem 
to  Illustrate  some  of  the  limitations  Inherent  in  formant  vocoders 
in  general. 
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We  shall  excunine  first  the  performance  for  non-nasal  vowels  and 
for  the  vowel-llke  sounds  /w  J  i  r  A  since  the  formant  vocoder 
technique  Is  designed  to  reproduce  such  sounds  In  a  reasonably 
straightforward  way.  The  group  of  sounds  for  which  the  fewest 
errors  are  made  Is  the  back  vowel  series  /a /nu  u/.  These  vowels 
can  be  said  to  be  distinguished  on  the  basis  of  the  frequency  posi¬ 
tion  of  a  main  concentration  of  energy  In  the  frequency  range  200 
to  1200  cps.  Apparently  the  formant  trackers  detect  this  energy 
concentration  adequately,  and  a  correct  Identification  Is  made 
within  this  group  of  vowels  whether  one  or  two  formants  are  assigned 
to  this  region. 

A  large  niunber  of  errors  are  made,  however,  for  the  two  vowel 
groups  /l  ae  a  u/  and  /i  e  /v  u/caused,  apparently,  by  errors  In 
tracking  the  second  formant.  In  the  case  of  /l/,  for  example, 
two  formants  seem  to  be  often  assigned  to  the  strong  energy  con¬ 
centration  at  low  frequencies.  For  other  vowels  there  Is  a  tendency 
for  the  first  two  formants  to  be  called  one  formant  when  they  are 
closely  spaced,  such  that  the  second  formant  Is  then  assigned  In¬ 
correctly.  The  Identification  of  the  liquids  /I  r/ls  quite  poor, 
since  the  particular  system  tested  was  Inherently  unable  to  track 
or  to  generate  the  low-frequency  third  formemt  that  Is  characteris¬ 
tic  of  /r/.  The  glides  /w  J/  are  extreme  examples  of  cases  where 
there  are  two  closely- spaced  formants  (the  first  two  for  /w/,  the 
second  and  third  for  /jA  and  It  Is  clear  from  the  results  for  these 
sounds  that  the  formants  are  not  tracked  correctly. 

When  Important  cues  for  Identification  come  from  rapid  formant  transi¬ 
tions,  as  In  consonant  classes  such  as  voiced  stops  and  nasals,  the 
percentage  of  errors  Is  higher  than  for  all  other  classes  of  conso¬ 
nants.  It  would  appear  that  substantial  tracking  errors  are  made 
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when  changes  In  formant  frequency  occur  In  a  few  tens  of  milli¬ 
seconds.  For  example,  6l  percent  errors  are  made  In  Identifying 
place  of  production  of  /b  d  g/.  Of  this  group,  the  response  that 
Is  made  more  often  than  the  others  Is  /g/,  which  Is  characterized 
by  formants  that  move  less  rapidly  than  those  for  /b/  and  /d/. 
Likewise,  the  nasals  are  Identified  only  slightly  above  chance 
level.  Errors  In  place  of  production  for  voiceless  fricatives 
(13  percent  for  /f  s  //)are  not  as  high  as  for  other  classes  of 
consonants,  since  a  special  circuit  was  Incorporated  In  the  synthe¬ 
sizer  to  accommodate  this  class  of  sounds.  Voiceless  stops  (49 
percent  error)  and  voiced  fricatives  (32  percent  error),  however, 
are  not  Identified  as  accurately  as  voiceless  fricatives. 

The  data  obtained  from  the  PB  word  Intelligibility  tests,  the 
quality  Judgments  and  the  talker-ldentlflcatlon  tests  all  Indicate 
that  the  performance  of  the  formant  vocoder  was  poorer  than  that 
of  all  other  systems  tested.  This  performance  Is  the  result  of 
errors  both  In  reproduction  of  fundamental  frequency  and  In  repro¬ 
duction  of  the  spectrum.  Furthermore,  the  tests  on  the  compre¬ 
hension  of  continuous  speech  In  the  presence  of  noise  Indicate  that 
noise  mixed  with  the  Input  speech  results  In  a  sharp  Increase  In 
the  errors  In  tracking  the  various  parameters.  Apparently  the 
method  used  to  track  formants  In  the  system  tested  (a  method  based 
on  measurement  of  average  density  of  zero-crossings  In  particular 
frequency  bands)  was  rather  sensitive  to  noise  at  the  Input. 

3.5  Spectrum  Sampling  (Narrow  Band)  System 

In  the  spectrum  sampling  system  the  speech  Is  passed  through 

several  narrow  frequency  bands  distributed  throughout  the  speech 

frequency  range,  and  the  resulting  signal  Is  transmitted  to  the 
27 

receiver.  The  hypothesis  Is  that.  If  the  sampled  frequency  bands 
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are  properly  selected,  the  listener  may  perform  some  sort  of  spectral 
Interpolation  such  that  the  intelligibility  is  much  greater  than 
that  of  a  single  continuous  frequency  band  of  the  same  total  width. 

In  the  particular  system  studied  in  the  present  series  of  tests, 
three  frequency  bands  were  used  as  follows :  400-800,  1550-1950  and 
3425-3550  cps  at  points  30  db  down  from  the  mid-band  gain,  giving 
a  total  analog  bandwidth  of  925  cps  by  this  measure.  However,  in 
the  transposition  of  these  three  bands  into  a  single  continuous 
band  for  transmission,  the  bands  were  overlapped  slightly,  so  that 
a  total  continuous  bandwidth  of  only  about  800  cps  was  used.  Other 
combinations  of  bands,  possibly  six  bands  Instead  of  three,  could 
have  been  selected  to  give  greater  or  less  total  bandwidth  and 
different  overall  Intelligibility.  At  the  time  the  present  tests 
were  conducted  a  system  of  only  three  bands  was  available. 

The  results  of  the  nonsense  syllable  tests  Indicate  that,  in  com¬ 
parison  with  other  systems  with  comparable  bandwidth,  the  band 
selection  system  reproduces  the  voiced- voiceless  distinction  and 
the  manner  of  articulation  reasonably  well  (2.4  percent  errors  for 
voiced-voiceless,  4.3  percent  for  interrupted-continuant).  As 
would  be  expected,  any  distinction  that  depends  primarily  on  temporal 
rather  than  detailed  spectral  characteristics  of  the  signal  should 
be  received  with  a  fairly  small  number  of  errors  for  this  system, 
since  it  Imposes  no  distortion  on  the  temporal  characteristics. 

The  number  of  errors  in  identification  of  place  of  production  for 
vowels  and  consonants  is,  however,  quite  high  in  comparison  with 
most  of  the  other  systems,  as  Pig.  2.4-3  snows.  The  types  of  errors 
seem  always  to  follow  a  pattern  that  is  closely  related  to  the  partlcu' 
lar  frequency  bands  used  in  the  system.  For  example,  relatively  few 
errors  are  made  in  identifying  /s/  aind  ///,  whereas  /f/  is  identified 
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as  /s/  most  of  the  time.  The  fricative  ///  has  a  major  concentra¬ 
tion  of  energy  around  2000  cps,  and  apparently  this  Is  adequately 
reproduced  by  the  1550-1950  cps  frequency  hand.  A  spectral  energy 
maximum  around  3500  cps  could,  on  the  other  hand,  lead  to  an  ac¬ 
ceptable  /s/.  For  /f/,  the  spectrum  Is  usually  rather  flat  In 
the  frequency  range  up  to  5000  cps.  Introduction  of  an  artificial 
peak  around  3500  cps  could  easily  lead  to  erroneous  Identification 
of  /f/  and  /s/.  A  similar  pattern  of  response  Is  found  for  the 
voiced  fricatives  /v  z  3/. 

The  errors  In  Identification  of  stop  and  nasal  consonants  are 
patterned  In  a  way  that  Indicated  a  high  accuracy  of  Identification 
of  the  post-dental  consonants  /t  d  n/,  with  many  more  errors  for 
the  bllablals  and  velars.  Cues  for  the  identification  of  these  con¬ 
sonants  are  known  to  be  carried  by  formant  transitions  of  the  adja¬ 
cent  vowel,  particularly  transitions  of  the  second  formant.  For 
post-dental  consonats,  the  locus  or  target  frequency  of  the  second 
formant  Is  known  to  be  In  the  vicinity  of  I8OO  cps.  The  small 
number  of  errors  for  post-dental  consonants  apparently  arises,  there¬ 
fore,  from  the  fact  that  this  locus  frequency  is  In  one  of  the  fre¬ 
quency  bands  passed  by  the  narrow  band  system.  Performance  for  the 
other  consonants  In  stop  and  nasal  classes  Is,  however,  quite  poor, 
with  the  result  that  the  overall  errors  In  place  of  production  for 
these  consonants  are  In  the  range  30  to  40  percent. 

The  pattern  of  errors  for  vowels  Is  likewise  explainable  In  terms 
of  the  frequency  rainges  of  the  lower  two  filters  in  relation  to 
the  frequencies  of  the  first  two  vowel  formamts.  In  general  It  can 
be  said  that  the  formant  frequencies  for  the  front  vowels  /l  1  e  as/ 
lie  within  the  frequency  ranges  of  the  lower  two  filters  (/l/  Is 
slightly  outside  the  range),  and  the  overall  error  score  for  these 
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vowels  In  all  tests  was  about  8  percent.  On  the  other  hand,  the 
formant  frequencies  of  the  second  formants  for  the  back  vowels 
are  always  below  the  frequency  range  1550-1950  cps  passed  by  the 
system.  This  Is  reflected  In  the  overall  error  score  for  the 
back  vowels  In  all  vowel  tests,  which  was  about  35  percent. 
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4.  STATUS  OF  VARIOUS  SFEBCH  COMPRESSION  TECHNIQUES  AND  RECOMMEN¬ 
DATIONS  FOR  FUTURE  RESEARCH  AND  DEVELOPMENT 

4.1  Introduction 

To  facilitate  a  discussion  of  the  merits  and  possible  future  po¬ 
tential  of  presently  available  speech  compression  techniques,  it 
Is  convenient  to  group  the  techniques  according  to  the  degree  of 
compression  they  achieve.  Five  groups  have  been  arbitrarily  de¬ 
fined,  as  shown  In  Table  4.1-1.  Each  group  Includes  those  tech¬ 
niques  which  require  an  Information  rate  In  the  Indicated  range 
for  digital  operation,  or  a  corresponding  bandwidth  In  the  Indi¬ 
cated  range  for  analog  operation.*  The  techniques  under  each 
group  heading  will  be  discussed  Individually  and  evaluated  with 
respect  to  their  relative  strength,  their  possible  future  poten¬ 
tial,  and  their  readiness  for  equipment  development.  In  addition, 
reconmendatlons  for  more  research  effort  on  specific  systems  will 
be  made  wherever  applicable.  Finally,  Section  4.7  will  give  a 
survey  of  current  research  that  Is  relevant  to  the  development 
of  speech  compression  systems  In  general. 

The  relative  strength  of  a  particular  technique  Is  estimated 
partly  on  the  basis  of  test  results  obtained  from  representative 
compression  systems.  Particular  attention  was  given  to  results 
from  FB  word  Intelligibility  tests  and  voice  quality  tests.  The 
relative  complexity  of  the  technique  Is  adso  considered  In  this 


*  The  relation  between  bandwidth  for  analog  operation  and  Infor¬ 
mation  rate  for  digital  operation  Is  not  Invariant,  and  depends 
upon  the  manner  In  which  the  analog  signal  Is  coded.  For  many 
of  the  cooqpresslon  systems,  the  analog  signals  are  sampled 
periodically  at  the  Nyqulst  rate  of  about  two  times  the  analog 
bandwidth,  and  the  amplitudes  of  the  quantized  samples  are  speci¬ 
fied  by  three  to  five  bits.  This  rule  was  used  to  determine  the 
corresponding  ranges  of  bandwidth  and  Information  rate  In  Table 
4.1-1. 


-98- 


Report  No.  914 


Bolt  Beranek  and  Nevnnan  Inc. 


estimate  since  the  amount  of  equipment  associated  with  a  given 
technique  may  restrict  its  practical  value  and  application.  The 
necessary  data  on  PB  word  intelligibility,  voice  quality,  and  com¬ 
plexity  were  obtained  largely  from  the  present  studies,  particularly 
for  the  systems  actually  tested,  but  complementary  information  was 
also  obtained  from  studies  that  have  been  reported  previously  and 
from  other  sources 

The  discussion  to  be  given  in  the  following  sections  will  indicate 
that  some  speech  compression  techniques  are  now  ready  for  equipment 
development,  in  particular  the  semi-vocoder,  the  spectrum  sampling 
scheme,  and  the  channel  vocoder,  while  other  techniques  are  in  need 
of  more  research^  Research  on  the  semi -vocoder  and  channel  vocoder 
techniques  should,  however,  continue  with  a  view  to  obtaining  further 
Improvements  in  performance.  Those  techniques  providing  an  inter¬ 
mediate  degree  of  compression  (Groups  B  and  C)  are,  in  general,  the 
moat  promising  for  the  Immediate  future,  although  techniques  offer¬ 
ing  more  compression  will  be  improved  and  may  become  more  attractive 
at  a  later  time.  On  the  immediate  horizon  is  the  3pectr\am  matching 
and  coding  procedxire,  which  may,  through  the  use  of  rather  complex 
terminal  equipment,  achieve  appreciable  compression  with  performance 
comparable  to  that  of  the  channel  vocoder.  Efforts  for  equipment 
development  of  partlcxilar  high-compression  systems  such  as  the  for¬ 
mant  vocoder  are  still  somewhat  premature,  considering  the  nxanber  of 
lonresolved  theoretical  questions  associated  with  these  relatively  new 
approaches. 
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Table  4.1-1. 

Presently  Available  Speech  Compression  Techniques 
Grouped  According  to  Degree  of  Compression  Achieved 
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4.2  Group  As  18,000  -  12 j 000  blts/aec  (2^000  -  1,500  cps) 

Band-pass  filtering.  This  Is  perhaps  the  simplest  approach  to  the 
problem  of  speech  compression^  and  one  which  has  been  studied  in 
great  detail.  Experiments  with  an  optimally-centered  2j000  cps 
wide  band-pass  filter  sviggest  that  PB-word  Intelligibility  scores 
of  85-905^  are  readily  obtainable,  although  the  voice  quality  is 
slightly  inferior  to  that  of  a  conventional  telephone  channel. 

Because  of  the  extremely  low  compression  that  can  be  achieved  by 
band-pass  filtering,  the  commercial  usefvilness  of  this  technique 
is  obviously  very  limited.  The  technique  is  of  some  value,  however, 
for  comparison  piirposes.  The  efficiency  of  another  compression 
technique  may  be  evaltaated  in  terms  of  the  bandwidth  it  reqxiires 
when  compared  to  the  bandwidth  of  an  equally  intelligible  system 
consisting  of  a  single,  optimally-centered  band-pass  filter. 

Amplitude  clipping.  Amplitude  clipping  refers  to  a  process  whereby 
the  input  speech  waveform  is  amplified  linearly  up  to  a  specified 
amplitude  level;  beyond  this  level  the  output  signal  does  not  in¬ 
crease  with  further  increases  in  the  input  signal.  A  modest  amount 
of  amplitude  clipping  has  little  effect  on  the  Intelligibility  of 
speech.  The  technique  is  employed  mainly  to  extend  the  effective 
range  of  radio-telephone  transmitters.  Infinite  clipping,  which 
reduces  the  speech  signal  to  a  rectangular  waveform,  gives  a  PB  word 
Intelilglblllty  near  dO^  and  a  rather  unpleasant,  harsh  voice  quality. 
Differentiation  of  the  speech  signal  before  clipping  improves  the 
Intelligibility,  especially  in  the  presence  of  noise,  and  Integration 
after  clipping  improves  the  qxiallty  somewhat,  but  Integration  before 
clipping  lowers  the  Intelligibility  drastically.  Llckllder^^  has 
determined  that  amplitude-dichotomized,  time-quantized  speech  waves 
are  reasonably  Intelligible  for  information  rates  above  8,000  - 
10,000  bits  per  sec,  Compression  systems  based  on  amplitude  clipping 
could  therefore  be  categorized  either  in  Group  A  or  Group  B. 
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Extremal  coding. This  is  a  digital  scheme  for  speech  trans¬ 
mission  that  is  related  to  time- quantized  clipped  speech.  The 
primary  advantages  of  this  technique  are  a  telephone-like  voice 
quality  and  a  relatively  high  PB  word  intelligibility  —  approach¬ 
ing  905^.  The  technique  is  complex,  however,  and  because  no  working 
model  of  a  system  has  been  constructed,  all  operations  have  been 
simulated  on  a  digital  computer.  The  extreme  amplitudes  of  the 
speech  wave  and  the  time  Intervals  between  these  extremes  must  be 
extracted,  and  a  buffer  memory  is  required  to  convert  the  randomly 
occurring  Information  about  the  extremes  to  a  viniform  rate  before 
transmission.  There  exists  a  possible  future  potential  for  this 
technique  in  commxinlcatlon  links  with  large  terminal  installations 
having  computing  and  PCM  facilities.  Equipment  development  may 
commence  in  this  area  without  further  research  effort. 

Time  compression.  This  procedure  involves  the  periodic  extraction 
of  time  samples  from  the  speech  signal.  These  samples  are  divided 
in  frequency,  abutted  in  time,  and  stored  for  later  reproduction 
at  a  speed  appropriate  for  restoration  of  the  speech.  With  a  pro¬ 
perly  chosen  sampling  period,  a  moderate  amount  of  time  compres¬ 
sion  can  be  achieved  with  a  PB-word  Intelligibility  somewhat  better 
than  805^.  The  voice  quality  is  probably  superior  to  that  of  clipped 
speech,  although  this  technique  is  Inherently  more  complex.  Various 
schemes  of  time  compression  have  been  investigated  in  detail,  and 
it  appears  that  the  commercial  possibilities  for  the  technique  in 
obtaining  a  bandwidth  compression  of  a  factor  of  two  or  more  are 
very  small.  Further  equipment  development  and  research  are  there¬ 
fore  not  recommended. 

Semi-vocoder  (base-band  or  voice-excited  vocoder).  The  semi- vocoder 
represents  a  compression  technique  which  clearly  has  possible  future 
potential.  PB-word  scores  neeu*  905^  and  a  good,  telephone-like  voice 
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quality  are  the  main  advantages  of  this  technique.  Semi- vocoders 
can  be  designed  to  use  available  coninunlcatlon  channels  having 
bandwldths  anywhere  In  the  range  800-3000  cps,  depending  on  the 
Intelligibility  and  voice  quality  required.  For  example,  "hl-fl" 
voice  transmission  has  been  achieved  with  a  semi-vocoder  that  uses 
a  conventional  telephone  channel.  '  The  filter  banks  required  at 
the  analysis  and  synthesis  terminals  of  the  semi-vocoder  constitute 
a  limitation  on  this  technique,  since  it  Is  difficult  to  build 
light  and  compact  terminal  equipment  when  a  number  of  filters  must 
be  Included.  Although  the  principles  of  the  semi-vocoder  are  well 
established,  it  is  suggested  that  further  research  could  profitably 
be  carried  out  in  order  to  arrive  at  optlmiun  filter  arrangements, 
optlmvim  filter  characteristics,  and  optimum  procedures  for  deriving 
an  excitation  signal  from  the  baseband.  Thus,  for  a  given  total 
bandwidth  for  the  transmission  signals  and  for  given  channel 
characteristics,  each  of  these  features  could  be  adjusted  to  maucl- 
mlze  the  Intelligibility  and  voice  quality.  Since  the  modification 
of  any  one  feature  of  the  equipment  would  probably  change  the  in¬ 
telligibility  of  only  a  small  ntunber  of  speech  sounds,  short  non¬ 
sense  syllable  intelligibility  tests  of  the  type  described  in  Sec¬ 
tion  2.4  could  be  used  during  this  research  phase.  Some  of  this 
work  has  already  been  carried  out  or  is  in  progress,  but  it  is 
suggested  that  further  studies  could  lead  to  improved  semi-vocoder 
performance.  At  the  seune  time,  it  is  clear  that  the  semi- vocoder 
technique  is  sufficiently  well  advanced  that  equipment  development 
may  be  scheduled  concurrent  with  the  research. 

Spectrum  sampling.  This  is  a  relatively  simple  technique  involving 
band-pass  filtering  of  two  or  more  radio-frequency  carriers  which 
are  amplitude-modulated  by  the  speech  signal.  The  filter  outputs 
are  demodulated  and  summed  to  reproduce  selected  regions  of  the 
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original  speech  spectrum.  With  a  system  having  three  65O  cps  wide 
filters  centered  about  optimxim  frequencies  (500,  1500,  2500  cps), 

PB  word  scores  neso*  90$^  have  been  obtained.  With  an  8- filter  sys¬ 
tem  (1500  cps  bandwidth*)  PB  word  scores  of  90-95J^  are  possible. 

In  general  It  may  be  said  that  the  nominal  bandwidth  required  for 
spectrum  sampling  with  an  optimum  number  and  location  of  filters 
Is  approximately  equal  to  one-half  of  the  bandwidth  of  an  optimally 
centered  band-pass  filter  providing  the  same  level  of  Intelligi¬ 
bility.  The  spectrum  sampling  technique  has  potential,  therefore. 

In  applications  where  only  a  modest  amount  of  compression  Is  re¬ 
quired,  and  where  simple,  compact  and  light  terminal  equipment  Is 
a  necessity.  It  cannot  compete  with  the  semi-vocoder,  however,  in 
situations  where  more  complex  terminal  equipment  can  be  allowed. 

4.3  Group  B;  12,000  -  5,000  blts/sec  (1,500  -  6OO  cps) 

The  Semi-vocoder.  The  semi-vocoder  Is  mentioned  again  in  this 
group  because  It  Is  essentially  capable  of  more  compression  than 
la  represented  by  a  bandwidth  of  1500  cps.  The  semi- vocoder  that 
was  tested  required  an  analog  bandwidth  of  9OO  cps  for  transmission 
of  both  the  base-band  and  the  vocoder  channel  information.  Including 
guard  bands. 

Comments  regarding  the  potential  of  the  semi-vocoder  and  the  re¬ 
search  required  to  Improve  the  semi-vocoder  performance  have  been 
given  In  section  4.2. 

Spectrum  sampling.  The  spectioiro  sampling  system  which  was  tested 
uses  three  filters  and  has  a  nominal  bandwidth  of  800  cps.  PB-word 
scores  approaching  70^  were  obtained  for  male  speakers,  and  the 


*  measured  at  the  30  db  down  points. 
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voice  quality  was  Inferior  to  that  of  the  tested  semi- vocoder. 

More  extensive  experiments  have  Indicated  that  optimum  performance 
(PB-word  scores  near  855^)  Is  reached  with  seven  or  more  filters 
for  a  nominal  bandwidth  of  about  1000  cps.  In  the  hope  of  being 
able  to  provide  more  compression  without  loss  In  Intelligibility, 
a  modification  of  the  Inherently  simple  spectrum  sampling  tech¬ 
nique  has  been  proposed.  This  modification  Involves  moving  the 
center  frequencies  of  one  or  more  of  the  sampling  filters  according 
to  the  short-time  energy  distribution  In  the  speech  spectrum.  Some 
research  needs  to  be  done  to  determine  the  effectiveness  of  this 
operation.  If  a  production  model  of  the  basic  technique  with  fixed 
filters  were  contemplated,  however,  equipment  development  could 
commence  without  further  research, 

Tasairof f -Daguet .  The  Tasaroff-Daguet  system  Is  described  and 
discussed  In  classified  Sections  6,1  through  6.3. 

Group  C;  5j000  -  2,000  blts/sec  (6OO  -  250  cpa) 

Channel  vocoder.  The  sixteen-channel  vocoder  provides  a  PB-word 
Intelligibility  approaching  85^  and  a  voice  quality  comparable  to 
that  obtained  for  the  spectrum  sampling  system  described  In  Group 
B.  The  voice  quality  of  this  channel  vocoder  Is  not  as  good  as 
that  of  the  semi-vocoder  because  of  errors  In  pitch  extraction  at 
the  sending  terminal,  errors  in  voiced-unvoiced  switching  at  the 
synthesis  tenninal,  and  less  accurate  reproduction  of  low-fre¬ 
quency  components  of  the  signal.  The  channel  vocoder  Is  Inherently 
more  complex  than  the  semi-vocoder,  since.  In  addition  to  the  fil¬ 
ters,  It  requires  equipment  for  coding  the  excitation  signal. 

It  appears  that  the  past  and  present  research  on  this  technique 
will  probably  lead  to  further  Improvements  In  performance.  Several 
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problems  are  now  being  Investigated^  and  these  and  others  require 
further  research:  (a)  Studies  are  needed  to  determine  what  features 
of  the  excitation  signal  should  be  reproduced  at  the  synthesizer 
In  order  to  obtain  maximum  Intelligibility  and  highest  voice  quality, 
(b)  Based  on  Information  obtained  In  (a),  procedures  must  be  devised 
for  extracting  appropriate  features  of  the  excitation  signal,  (c)  Pur 
ther  studies  are  needed  to  determine  the  arrangement  of  filters  and 
filter  characteristics  that  will  give  optimum  performance  for  a  given 
total  channel  bandwidth.  Relative  to  (c),  a  recent  review  of  speech 

iQ 

compression  techniques  makes  the  following  statement: 


Recent  developments  have  Indicated  that  the  steep,  flat- 
bottomed  band-pass  filter  characteristics  commonly  found 
In  early  vocoders  are  not  necessary,  and,  on  the  contrary, 
that  the  quality  of  the  synthesized  speech  may  be  Improved 
by  Introducing  simple,  narrow-band  tuned  circuits.  The 
band-pass  filters  at  the  analysis  end  require  somewhat 
greater  selectivity  than  those  at  the  synthesizer.  Other 
experiments  have  Indicated  that  a  dynamic  expansion  of 
channel  signals,  exaggerating  differences  In  spectral 
levels,  can  Improve  the  quality  of  conventional  vocoders. 


This  research  should  be  carried  out  with  a  view  to  determining.  In 
a  precise  way,  the  types  of  distortion  that  are  being  Imposed  on 
the  speech  signal  and  the  effect  of  these  distortions  on  the  per¬ 
ception  of  the  various  features  that  contribute  to  vowel  and  con¬ 
sonant  Intelligibility  and  to  voice  quality.  It  will  then  be 
possible  to  make  modifications  In  the  equipment  to  minimize  the 
perceptual  consequences  of  the  distortions.  Although  research  of 
this  type  Is  currently  In  progress  and  further  research  Is  proposed. 
It  Is  suggested  that  the  technique  as  It  stands  has  sufficient 
merit  to  warrant  equipment  development.  Such  development  has,  of 
course,  already  been  carried  out  to  some  extent,  but  as  research 
on  vocoders  progresses  the  results  of  the  research  should  be  ap¬ 
plied  to  the  development  of  new  equipment. 
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Cro3s~  and  Auto-correlatlon  Vocoders.  These  systems  are.  In  a 
sense,  time-domain  versions  of  the  channel  vocoder.  In  both 
schemes  the  fundamental  frequency  must  be  extracted  and  the  voiced- 
voiceless  distinction  must  be  made.  The  schemes  differ  from  the 
channel  vocoder  in  that  the  transmitted  signals  specify  the  wave¬ 
form  in  each  period  of  the  fxmdamental  rather  than  the  gross  spec¬ 
trum.  With  further  research  it  is  possible  that  the  performance 
of  the  cross-  and  auto-correlation  vocoders  can  reach,  but  probably 
not  exceed,  the  maximum  performance  of  conventional  channel  vocoders. 
An  Important  advantage  of  these  time-domain  schemes  over  the  channel 
vocoder  is  that  banks  of  filters  are  not  required  either  in  the  ana¬ 
lyzer  or  in  the  synthesizer,  and  hence  the  equipment  is  smaller  and 
less  complex. 

4.5  Group  D;  2,000  -  800  bits/sec  (250  -  100  cps) 

Formant  Vocoder.  This  vocoder  represents  a  technique  which  is  con¬ 
siderably  more  Involved  and  not  yet  as  refined  as  the  technique  of 
the  conventional  channel  vocoder.  The  performance  of  the  formant 
vocoder  from  the  point  of  view  of  both  intelligibility  and  voice 
quality  is  below  that  which  would  be  considered  acceptable  for  any 
but  the  most  restricted  applications.  Further  research  on  the 
nature  of  speech  and  further  progress  in  techniques  of  speech 
analysis  and  synthesis  are  required  before  development  of  a  com¬ 
plete  system  suitable  for  practical  use  is  commenced.  The  experi¬ 
ments  performed  in  connection  with  the  present  study,  as  well  as 
other  studies  reported  In  the  literature,  indicate  several  areas 
Where  research  is  needed.  Again,  some  of  this  research  is  already 
in  progress,  but  more  long-range  research  effort  is  needed  before 
a  system  of  the  formant -tracking  type  can  be  considered  to  have 
practical  value. 
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Among  the  topics  for  research  are  the  following:  (a)  Procedures 
for  extraction  of  formant  frequencies  during  non-nasal  voiced 
sounds  require  further  study.  Part  of  this  study  involves  the 
establishment  of  suitable  criteria  for  the  accuracy  of  formant 
tracking,  particularly  during  rapid  changes  In  formant  fre¬ 
quencies  at  vowel  boundaries,  (b)  Methods  for  synthesizing  and 
for  extracting  appropriate  parametric  representations  of  speech 
sounds  that  are  generated  by  vocal-tract  excitation  at  points 
other  than  the  glottis  or  that  are  characterized  by  nasal  con¬ 
sonants  need  to  be  examined  In  detail. 

Inability  of  the  analyzer  to  track  rapid  formant  transitions  and 
to  provide  a  proper  specification  of  consonant  spectra  seems  to 
be  a  basic  limitation  of  present  formant-tracking  systems.  Super¬ 
imposed  upon  these  errors  are  deteriorations  in  Intelligibility  and 
voice  quality  arising  from  inadequate  reproduction  of  the  voice  ex¬ 
citation.  As  noted  previously  ,  this  difficulty  Is  encountered  In 
any  system  that  requires  tracking  of  the  fundamental  frequency. 

As  a  result  of  concentrated  research  effort  It  may  eventually  be 
possible  to  realize  a  high-performance  formant  vocoder.  It  may 
happen,  however,  that  high  performance  can  only  be  achieved  at 
the  expense  of  more  complex  equipment,  possibly  requiring  some 
delay  In  order  to  perform  the  required  operations  on  the  speech 
signal. 

44 

Peak-picker.  The  peak-picker  may  be  considered  to  be  a  modifi¬ 
cation  of  the  conventional  channel  vocoder,  although  It  has  some 
features  of  the  formant  vocoder.  At  every  Instant  of  time  special 
peak-detection  clrcultiv  determines  those  few  filters  in  a  filter 
bank  which  have  the  greatest  outputs.  Signals  specifying  these 
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filters  and  the  magnitudes  of  their  outputs,  together  with  the 
usual  source  Information,  are  transmitted  to  the  receiver.  At 
the  synthesizer  only  those  modulators  which  correspond  to  the 
selected  filters  are  in  operation,  and  thus  peakpicking  allows  an 
additional  reduction  In  bandwidth  over  the  conventional  channel 
vocoder.  The  Intelligibility  and  voice  quality  of  speech  pro¬ 
cessed  In  this  nanner  are  somewhat  Inferior  when  compared  to  the 
test  results  obtained  for  a  well- engineered  channel  vocoder.  It 
Is  probable,  therefore,  that  the  future  potential  of  the  peak- 
picker,  at  least  In  Its  original  form.  Is  not  great. 

^.6  Group  E:  Below  800  blts/sec  (Under  100  cps) 

Spectrum  Pattern  Matching  and  Coding.  C.  P.  Smith  has  described 

a  technique  whereby  the  channel  capacity  required  for  the  channel 

vocoder  can  be  reduced  through  a  spectrum  matching  and  coding 

48  4Q 

process.  *  ^  This  technique  involves  the  quantization  of  in¬ 
coming  signals  In  frequency  and  time,  and  the  coding  of  successive 
spectral  samples  in  terms  of  a  limited  catalog  of  stored  speech 
patterns.  This  pattern-matching  technique  recjulres  an  estimated 
Information  rate  of  only  400  to  800  blts/sec.  Although  the  tech¬ 
nique  has  not  yet  been  tested,  it  Is  expected  to  give  a  speech 
output  comparable  to  that  of  the  conventional  digitized  channel 
vocoder.  A  disadvantage  of  the  method  is  that  a  large,  rapid- 
access  memory  Is  necessary;  hence  application  Is  restricted  to 
communication  links  with  elaborate  terminal  facilities.  In  view 
of  the  substantial  compression  achieved,  however.  It  Is  clear  that 
this  technique  has  considerable  potential.  Further  research  and 
equipment  development  are  being  carried  out  by  the  Air  Force  Cam¬ 
bridge  Research  Laboratories  and  their  contractors. 
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f. 


^.7  Current  Re8eeu»ch  Relevant  to  the  Development  of  Speech  Com¬ 
pression  Systems 

A  number  of  research  studies  that  are  currently  in  progress  are 
helping  to  contribute  to  our  knowledge  of  the  speech  communica¬ 
tion  process.  It  Is  appropriate  In  this  report  to  speculate  on 
the  potential  application  of  this  research  to  the  future  develop¬ 
ment  of  practical  speech  compression  systems.  It  Is  suggested 
that  speech  compression  systems  that  are  characterized  by  low  In¬ 
formation  rates  (less  than,  say,  2000  bit s/sec)  cannot  be  developed 
to  the  point  where  the  speech  quality  approaches  that  of  convention¬ 
al  systems  until  some  of  these  research  studies  have  yielded  a  bet¬ 
ter  understanding  of  the  hximan  speech  process.  Furthermore,  the 
research  should  lead  to  techniques  for  Improving  the  performance 
of  compression  systems  with  higher  Information  rates,  say  in  the 
range  2000  -  5000  bits  /sec. 

In  the  following  paragraphs  brief  descriptions  of  some  of  the  cur¬ 
rently  active  research  projects  are  given.  The  topics  include 
studies  of  the  nature  of  hxaman  generation  and  perception  of  speech 
and  studies  of  new  methods  of  speech  analysis  and  synthesis. 

Inverse  filtering.  One  research  Item  that  is  relevant  to  formant 

vocoder  systems  is  the  study  of  a  speech  analysis  procedure  known 

28  "^4 

as  Inverse  filtering.  The  procedure  can  be  considered  to 

be  an  application  of  a  general  analysis  technique  that  has  been 
called  "analysis  by  synthesis. The  method  can  be  used  to  ob¬ 
tain  rjather  precise  measures  of  the  frequency  positions  of  the 
poles  and  zeros  of  the  vocal-tract  transfer  function  diorlng  vowel 
and  consonant  utterances.  Procedures  for  extracting  parameters 
describing  these  pole  and  zero  locations  as  a  function  of  time  have 
an  Important  bearing  on  the  development  of  speech  compression  sys- 
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temSj  since  such  parameters  are  known  to  provide  a  compact  des¬ 
cription  of  the  speech  signal.  Basically  the  Inverse  filtering 
procedxire  requires  that  a  set  of  filters  be  adjusted  automatically 
In  such  a  way  that  the  transfer  function  of  the  filters  Is  the  re¬ 
ciprocal  of  the  transfer  function  of  the  vocal  tract  that  was  used 
to  generate  the  signal.  The  source  of  vocal-tract  excitation  th^n 
appears  at  the  output  of  the  Inverse  filter.  An  alternative  pro¬ 
cedure  Is  to  perform  the  operations  In  the  frequency  domain  by 

finding  a  set  of  pole  and  zero  locations  that  yield  a  spectrum 

2 

that  matches  the  speech  spectrum  under  analysis. 

These  Inverse  filtering  methods  have  not  yet  been  realized  In  a 
real-time  situation.  If  real-time  analysis  of  this  type  can  be 
achieved,  then  Its  Incorporation  Into  the  analyzer  of  a  formant 
vocoder  should  greatly  Improve  the  performance  of  the  system. 

The  analysis  procedure  Is  likely  to  require  rather  complex  opera¬ 
tions,  however,  and  some  delay  or  memory  will  probably  be  re- 
qxilred. 

Nature  of  glottal  wave.  It  has  been  reasonably  well  established 

that  the  voice  quality  of  both  synthetic  and  natural  speech  Is 

determined  largely  by  the  natiire  of  the  glottal  excitation.  In 

order  to  obtain  good  quality  for  synthetic  speech,  the  waveform 

of  each  glottal  excitation  pulse  must  have  the  proper  shape,  and 

the  proper  temporal  relations  must  exist  between  successive  glottal 

•a  14 

pulses  during  an  utterance."^*  Several  studies  of  these  and  other 
aspects  of  glottal  excitation  are  In  progress,  and  the  results  of 
these  studies  should  lead  to  suggestions  for  the  Improvement  of 
voice  quality  of  vocoder  systems.  Including  specifications  for  the 
design  of  Improved  pitch  extractors. 
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In  one  group  of  studies ^  a  careful  examination  Is  being  made  of 
the  waveform  of  the  volume  velocity  output  of  the  glottis,  using 

■3  -ail 

Inverse  filtering  techniques.'^**'  These  studies  show  that  the 
waveform  has  a  triangular  appearance,  and  hence  the  spectrum  of 
the  glottal  output  Is  characterized  by  a  set  of  zeros  located 
close  to  the  Jw-axls  In  the  complex  frequency  plane.  A  more 
fvmdamental  approach  Is  being  followed  by  groups  that  cure  ex¬ 
amining  In  detail  the  mechanism  of  operation  of  the  larynx 

50  55 

through  photographic  and  other  techniques.''^  These  studies 

also  show  that  the  area  of  the  glottis  opening  has  a  triangular 
waveshape,  but  that  this  waveform  varies  with  voice  effort  and 
with  fundamental  frequency. 

Measurements  of  the  intervals  between  successive  glottal  pulses 
have  shown  that  a  certain  amount  of  quasi-random  pulse  position 
modulation  is  superimposed  on  the  smooth  changes  In  Inter-pulse 

•30 

Interval  associated  with  varying  inflection  patterns.-*  Percep¬ 
tual  experiments  with  synthetic  speech  have  demonstrated  that  the 
voice  quality  is  Improved  if  such  randomnesses  are  superimposed 
on  the  regular  Inflection  patterns. 

Pitch  extraction.  Although  the  first  pitch  extractor  was  devised 
several  decades  ago,  effort  continues  to  be  devoted  to  the  develop 
raent  of  a  device  that  Indicates  the  position  of  each  glottal  pulse 
with  a  minimum  number  of  errors.  The  basic  problem  Is  to  devise 
a  procedure  that  operates  In  a  satisfactory  manner  for  a  wide 
range  of  fundamental  frequencies  and  for  many  different  speakers 
and  different  voice  efforts.  Most  of  the  schemes  that  have  been 
under  study  recently  have  attempted  to  find  the  location  of  each 
glottal  pulse  by  making  a  number  of  separate  determinations  of  the 
pulse  location  through  a  series  of  measurements  on  the  waveform  or 
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on  some  transformed  version  of  the  waveform,  and  then  making  a 
decision  on  the  presence  or  absence  of  a  pulse  by  looking  for  a  coin¬ 
cidence  among  several  of  the  separate  determinations. 

It  Is  evident  that  research  on  methods  for  pitch  extraction  shovild 
go  hand  In  hand  with  basic  studies  of  the  glottal  wave  In  natural 
speech  and  with  studies  of  human  pitch  perception.  The  psycho- 
acoustic  studies  should  help  to  establish  criteria  that  Indicate 
whether  a  pitch  extractor  la  operating  In  a  satisfactory  manner. 

Articulatory  synthesizer.  As  discussed  above,  a  limitation  of  the 
present  formant  vocoder  technique  stems  from  the  fact  that  the 
synthesizer  Is  designed  primarily  for  the  generation  of  vowels  and 
vowel-llke  sounds,  and  It  is  necessary  to  generate  certain  consonant 
sounds  by  means  of  special  circuits  or  through  some  other  tour  de 
force.  Some  effort  is  currently  being  devoted  to  the  development 
of  a  synthesizer  that  Is  in  principle  capable  of  generating  all 
vowels  and  consonants  with  a  single  time-varying  network.  This 
synthesizer  Is  an  analog  of  the  acoustic  tube  that  forms  the  vocal 
tract  between  the  glottis  and  the  lips,  Including  the  nasal  cavi¬ 
ties. Changes  in  vocal-tract  configuration  are  simulated 
In  the  synthesizer  through  control  of  the  values  of  a  set  of  variable 
electrical  elements.  The  analog  circuit  can  be  excited  by  an 
electrical  buzz  source  at  one.  end,  simulating  glottal  excitation, 
or  by  an  electrical  noise  source  at  some  point  along  Its  length, 
slm\ilatlng  excitation  by  noise  that  results  from  txirbulence  In  the 
vocal  tract.  Thus  voiced  and  voiceless  sounds  are  generated  by  the 
same  circuit  simply  by  changing  the  location  and  spectrum  of  the  source. 

Present  research  Is  devoted  to  finding  the  nature  of  the  control 
signals  that  must  be  applied  to  such  a  synthesizer  In  order  to 
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generate  natural  speech.  Such  research  must  necessarily  Include 

studies  of  the  actual  vocal-tract  configurations  used  In  natural 

speech  and  the  motions  of  the  anatomical  components  that  give  rise 

ilQ 

to  these  configurations.  In  order  to  minimize  the  Information 

rates  of  the  signals  controlling  the  synthesizer,  means  must  be 
found  for  describing  the  articulatory  activities  with  a  relatively 
small  number  of  parameters. It  Is  probable  that  the  Information 
rates  associated  with  the  control  signals  for  such  a  synthesizer 
would  be  equal  to  or  less  than  those  needed  to  control  the  syn¬ 
thesizer  In  a  formant  vocoder,  l.e.,  1000  blts/sec  or  less. 

Articulatory  analysis.  If  a  speech  compression  system  using  axi 
articulatory  synthesizer  is  contemplated.  It  will  be  necessary  at 
the  transmitting  end  to. extract  from  the  speech  wave  a  set  or  con¬ 
trol  signals  that  describe  In  some  approximate  manner  the  configura¬ 
tions  and  excitations  of  the  vocal  tract.  Research  toward  these 
objectives  has  hardly  begiin,  but  it  is  evident  that  a  reasonably 
complex  set  of  calculations  will  have  to  be  made  in  order  to  ex- 

Op 

tract  the  proper  signals. 

Acoustic  properties  of  speech  sounds.  Vfhlle  It  is  known  that  cer¬ 
tain  classes  of  speech  sounds,  particularly  the  non-nasal  vowels, 
can  be  described  rather  precisely  yet  compactly  In  terms  of  the 
frequencies  of  the  first  two  or  three  formants,  comparable  pro¬ 
cedures  for  the  description  of  other  classes  of  sounds  are  still 
under  study.  Such  procedures  are  relevant  to  the  development  of 
compression  systems  of  the  formant  vocoder  type,  since  It  Is  de¬ 
sirable  to  devise  methods  for  properly  generating  various  classes 
of  soxinds  at  the  receiver  and  for  extracting  from  the  input  speech 
signal  a  set  of  parameters  that  can  be  used  to  control  the  syn¬ 
thesizer. 
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Studies  have  shown,  for  example,  that  the  spectra  of  fricative 

consonants  can  be  described  approximately  by  one  or  two  poles  emd 
23 

one  zero,  but  methods  for  automatically  extracting  these  para¬ 
meters  from  the  speech  signal  have  not  yet  been  devised.  Similarly, 
the  spectra  of  nasal  consonants  are  characterized  by  a  number  of 
poles  and  zeros,  althoxjgh  the  synthesis  of  consonants  in  this  class 
can  be  approximated  by  using  a  circuit  whose  transfer  function  has 

only  poles  if  the  bandwidths  of  the  resonances  are  made  sufficiently 
4l 

broad.  Further  studies  are  necessary,  however,  before  these  and 
other  classes  of  consonants  are  imderstood  sufficiently  well  that 
they  can  be  handled  properly  in  a  formant  vocoder  system. 

Cues  for  identification  of  speech  sounds.  Over  a  period  of  years 
a  nximber  of  studies  have  been  carried  out  to  determine  which  features 
of  the  acoustic  speech  signal  constitute  the  principal  cues  used 
by  a  listener  to  identify  the  signal.  For  example,  experiments 
have  led  to  an  approximate  specification  of  the  directions  and 
rates  of  change  of  the  formant  transitions  between  vowels  and  stop 
and  nasal  consonants. These  studies  have  particular  signifi¬ 
cance  for  the  design  of  speech  compression  systems,  since  they 
indicate  the  types  of  distortions  that  are  likely  to  obscure  cer- 
tain  cues  and  thus  lead  to  loss  of  intelligibility.  Thus,  as 
the  results  of  these  perceptual  studies  become  available,  they 
should  suggest  how  present  speech  compression  systems  can  be  im¬ 
proved  and  they  will  provide  a  basis  for  the  design  of  future 
systems. 
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5.2  Appendix  II  -  Statistical  Analyses 

The  experimental  proced\ires  used  In  the  PB  word  Intelligibility 
and  talker  Identification  tests  were  arranged  to  permit  the  deter¬ 
mination  of  some  overall  measure  of  significance  In  the  form  of  an 
analysis  of  variance.  In  the  case  of  the  Intelligibility  data 
a  three-way  analysis  scheme  was  used.  The  arrangement  may  be  visual¬ 
ized  In  the  form  of  a  cube  with  the  systems  \inder  test  on  one  axis, 
the  talkers  who  read  the  test  materials  on  another  axis,  and  the 
listeners  (subjects)  on  the  third  axis.  The  talker  Identification 
testa  used  a  two-way  analysis  scheme  In  which  the  systems  under  test 
were  on  one  axis  and  the  listeners  on  another. 

Table  5.2-1  shows  the  analysis  of  variance  of  the  Intelligibility 
scores  obtained  for  seven  systems,  using  two  male  talkers  and  eight 
listeners.  When  the  triple  Interaction  variance  term  Is  used  as  the 
demonlnator  In  an  ratio,  none  of  the  simple  Interactions  reach  a 
magnitude  that  Is  significant  at  the  1^  level  of  confidence.  Sim¬ 
ilar  tests  of  the  main  effects  demonstrate  that  the  variance 
attributable  to  Systems  and  to  Listeners  Is  statistically  significant 
(pi  .01);  there  is  no  marked  effect  of  Talkers  in  this  study. 
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Table  5.2-1 

Analysis  of  Variance  of  Intelligibility  Scores 
Obtained  for  Seven  Systems  Using  Two  Talkers 
and  Eight  Listeners 


Source  of  Variation 

Sum  of  Squares 

d.f . 

Variance 

p* 

Systems  (S) 

41,305.11 

6 

6,884.19 

330.34** 

Talkers  (T) 

89.26 

1 

89.26 

4.28 

Listeners  (L) 

539.99 

7 

77.14 

3.70** 

T  X  S 

340.55 

6 

56.76 

2.72 

T  X  L 

139.31 

7 

19.90 

- 

S  X  L 

1,295.82 

42 

30.85 

1.48 

T  X  S  X  L 

875.38 

42 

20.84 

- 

Total 

44,585.42 

111 

*  £  =  V/V^EXL 

**  p  i  0.01  level  of  confidence 


The  finding  that  a  significant  variance  is  attributable  to  Systems 
permits  an  examination  of  the  differences  among  the  system  scores. 

The  results  of  _t  tests  in  Table  5.2-2  demonstrate  that  almost  all  of 
the  ranked  system  scores  differ  significantly  from  contiguous  scores, 
(it  is  probable  that  the  Stromberg  system  differs  significantly  from 
the  Tasaroff-Daguet  system,  at  least  at  the  p  =  0.05  level  of  con¬ 
fidence  .  ) 
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Table  5.2-2 

Mean  Intelligibility  Scores  and  ^  Tests 
Arranged  to  Demonstrate  Significant  Differences  Among  Systems 
(Seven  Systems >  Two  Talkers  and  Eight  Listeners) 


Systems 

M  Score 

M  Dlff . 

Reference 

95 

Q 

Stromberg 

86 

1 

Phllco 

85 

6 

Tasaroff-Daguet 

79 

11 

Narrow  Band 

68 

7 

Hughes 

61 

28 

Melpar 

33 

t 

8,46* 

1.25 

3.20* 

6.19* 

4.93* 

13.33* 


*  P  i  0.01  level  of  confidence 


A  similar  kind  of  analysis  was  made  with  five  systems  using  four 
talkers  (two  male  and  two  female)  and  eight  listeners.  The  results 
of  this  analysis  are  shown  in  Table  5.2-3.  In  this  analysis  the 
Talker  x  Systems  Interaction  was  significantly  large  and  was  used 
to  test  certain  main  effects.  This  interaction  Indicates  that  the 
score  obtained  from  a  given  system  was  affected  significantly  by 
differences  among  the  talkers.  The  major  sources  of  variation, 
however,  were  contributed  by  the  Systems,  the  Talkers,  and  the 
Listeners  (main  -effects ) . 
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Table  5.2-3 

Analysis  of  Variance  of  Intelligibility  Scores 
Obtained  for  Five  Systems  Using  Povir  Talkers 
and  Eight  Listeners 


Source  of  Variation 

Stun  of  Squares 

d.f . 

Variance 

p* 

Systems  (S) 

30,422.75 

4 

7,605.68 

33.62** 

Talkers  (T) 

22,227.47 

3 

7,409.15 

32.75** 

Listeners  (L) 

1>225.32 

7 

175 . 04 

13.62** 

T  X  S 

2,714.90 

12 

226 . 24 

17.61** 

T  X  L 

456.18 

21 

21.72 

1.69 

S  X  L 

622.30 

28 

22.22 

1.73 

T  X  S  X  L 

1,079.45 

84 

12.85 

- 

Total 

58,748.37 

159 

*  ~  “  ^TxS^^TxSxL'  ^Txl/^TxSxL"  ^Sxl/^TxSxL'  ^l/^TxSxL' 

^S^'^TxS’  ^T^^TxS 
**  p  i  0.01  level  of  confidence 


The  significant  differences  among  the  five  systems  are  specified 
In  Table  5*2-4,  which  supports  the  findings  of  the  earlier  analysis 
(Table  5.2-2). 
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Table  5.2-4 

Mean  Intelligibility  Scores  and  t  Tests  Arranged  to 
Demonstrate  Significant  Differences  Among  Systems 
(Five  Systems,  Pour  Talkers  and  Eight  Listeners.) 


Systems 

M  score 

M  diff 

t 

Reference 

89 

17 

5.61* 

Phllco 

72 

1 

1.19 

Stromberg 

71 

11 

10.14* 

Narrow  Band 

6o 

13 

9.20* 

Hughes 

47 

*  Pi  0.01  level  of  confidence 

Similar  analyses  have  been  accomplished  for  the  talker  identifi¬ 
cation  data  obtained  with  Quartets  I  and  II.  In  this  case,  how¬ 
ever,  the  scores  were  arranged  to  test  the  contributions  to  the 
total  variance  of  the  Systems  and  the  Listeners.  The  results  of 
these  two-way  analyses  of  variance  are  presented  in  Tables  5*2-5 
and  5.2-7.  For  both  quartets  there  were  statistically  significant 
differences  among  the  talker  identification  scores  obtained  from 
the  systems  under  test,  but  the  ranges  of  scores  were  smaller 
than  those  obtained  in  the  Intelligibility  tests.  The  mean  talker 
identification  scores  obtained  for  Quartets  I  and  II,  along  with 
the  results  of  _t  tests  between  adjacent  scores,  are  shown  in 
Tables  5.2-6  and  5.2-8,  respectively. 
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Table  5.2-5 

Analysis  of  Variance  of  Talker  Identification  Scores 
Obtained  with  Quartet  I,  using  Seven  Systems  and  29  Listeners 


Source  of  Variation 

Sum  of  Squares 

d.f. 

Variance 

El 

Systems 

551.77 

6 

91.96 

17.96** 

Listeners 

462.40 

28 

16.51 

3.23 

Remainder 

859. 94 

168 

5.12 

— 

Total 

1,874.11 

202 

•2  = 

**  p  <  0.01  level  of  confidence 

Table  5-2-6 

Mean  Talker  Identification  Scores  and 

t  Tests  for  Quartet  I 

Arranged  to  Demonstrate  Significant  Differences 

Among  Systems  (Seven  Systems 

and  29 

Listeners. ) 

Systems 

M  score  M  dlff 

t 

Reference 

54 

9 

2.69* 

Tasaroff-Daguet 

45 

0 

-- 

Narrow  Band 

45 

2 

Phllco 

43 

2 

Stromberg 

4l 

8 

3.23* 

Hughes 

33 

6 

1.96* 

Melpar 

27 

*  p  <  0.05  level  of  confldenc 


-13- 


Report  No.  9l4  Bolt  Beranek  and  Neviman  Inc. 


Table  5.2-7 

Analysis  of  Variance  of  Talker  Identification  Scores  Obtained 
with  Quartet  II,  using  Seven  Systems  and  30  Listeners 


Source  of  Variation 

Sum  of  Squares 

d.f. 

Variance 

p* 

Systems 

731.63 

6 

121.94 

19.05** 

Listeners 

505.42 

29 

17.43 

2.72 

Remainder 

1,113.20 

174 

6.4o 

-- 

Total 

2,350.25 

209 

**  p  <  0.01  level  of  confidence 


Table  5.2-8 

Mean  Talker  Identification  Scores  and  jt  Tests  for 
Quartet  II  Arranged  to  Demonstrate  Significant  Differences 
Among  Systems  (Seven  Systems  and  30  Listeners.) 


System 

M  score 

M  dlff 

t 

Reference 

59 

5 

1.22 

Stromberg 

54 

6 

2.33* 

Narrow  Band 

48 

4 

1.46 

Tasaroff-Daguet 

44 

7 

2.69* 

Melpar 

37 

2 

Phllco 

35 

4 

Hvighes 

31 

*  p  <  0.05  level  of 

confidence 
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